

# PROBLEM SOLVED.

## Memory Channel Storage<sup>™</sup>

### Maher Amer

**CTO Diablo Technologies** 



## DIABLO TECHNOLOGIES HIGHLIGHTS





## IT'S ALL ABOUT THE **APPLICATIONS!**

| Ś |  |
|---|--|
|   |  |

LOW LATENCY APPLICATIONS

+ TRANSACTION LOGGING

+ TRANSACTION PLAYBACK

+ LOW LATENCY MESSAGING

+ ELECTRONIC TRADING



VIRTUAL DESKTOPS

+ DATA CACHING + SERVER CONSOLIDATION

+ VM GOLDEN IMAGES

+ VSAN

DATABASE/ HYPERSCALE

+ FREQUENTLY ACCESSED TABLES

+ LOG FILES

+ TEMPDB FILES

+ INDEXING

+ PARTITION TABLES



BIG DATA ANALYTICS + LOG REDUCTION

+ LOG PARSING

+ EVENT LOGGING

+ MEMCACHED

+ IN MEMORY DATA GRIDS



SERVER VIRTUALIZATION

+ VSAN

+ SOFTWARE DEFINED STORAGE

+ VIRTUALIZED DATABASES

+ DATA CACHING



## THE PERFORMANCE TRADE-OFF

 Traditionally customers have faced a suboptimal trade-off in storage system design:



### A Painful Workaround...

- + When SSD "IOPS vs. Latency" trade-offs are unacceptable, adding expensive RAM is a traditional recourse.
- However, adding RAM can create an imbalance between incremental performance requirements and rapidly growing solution cost.





## FLASH STORAGE EVOLUTION THUS FAR





### Enter Memory Channel Storage (MCS™)



Massive flash capacity exposed through the low-latency memory subsystem









### MCS IN SYSTEM MEMORY MAP





## HARDWARE ARCHITECTURE





## SOFTWARE ARCHITECTURE





## **DRIVER DETAILS**



### + Plugs into block layer:

+ Bypasses SCSI/SATA on Linux + Emulates SCSI on Windows and VMware

### + Handles req's asynchronously:

- + Kernel posts requests into driver's incoming request queue.
- + Driver thread generates commands, posts to device, checks status, and copies data.

### + Handles data and control req's:

- + 512B 4kB native atomics
- + Up to 32kB atomics with FW aid
- + SMART logs, thermal data, stat, events etc.



### **EXAMPLE WRITE**



### **1** OS requests a write.

- 2 Driver writes data to write buffer. (4kB data plus optional metadata)
- 3 Driver constructs a Diablo MCS protocol command, and writes it to a command buffer.

(encodes intent, LBA, buffer number, and E2E integrity metadata)

- **4** Driver checks status.
- **5** Driver completes the write.



### **EXAMPLE READ**



### **① OS** requests a read.

2 Driver constructs a Diablo MCS protocol command, and writes it to a command buffer.

(encodes intent, LBA, buffer number, and E2E integrity metadata)

- Driver checks status.
- Driver reads data from read Buffer, and validates integrity. (4kB data plus optional metadata)

**5** Driver completes the read.



### **CONFIGURABLE DEVICE GROUPS**



**NOTE 1:** Shown only for one NUMA node, but this pattern is replicated on each node.

**NOTE 2:** Devices can be combined in any combination.

Device Grouping:
+ Configurable CPU affinity
+ 1 Thread round robins between active devices
+ Efficiency through driver/device locality
+ Flexible prioritization of latency vs. CPU usage



## TECHNOLOGY COLLABORATION TO CREATE THE FIRST MCS-ENABLED PRODUCT



- + Reference architecture design+ DDR3 to SSD ASIC/firmware
- + Kernel and application level software development
- + OEM System Integration and enterprise application domain knowledge



# **SanDisk**<sup>®</sup>

- + Guardian Technology for enterprise applications
- + SSD controller & FTL firmware development and test
- + Supply Chain and Manufacturing with flash partner
- + System Validation



### REDUCED LATENCY ENABLES REAL-TIME ANALYTICS





### + THE APPLICATION HAS BECOME THE BOTTLENECK IN E-TRADING



## MEMORY MAPPED I/O ACCELERATION

### 10 million records (20GB mmap) using synchronous msync calls



mmap Random Write: Write Latency Histogram



#### mmap Random Write: Write Latency Percentiles

+ MCS 99th-percentile latency is 2x lower than Competitor 2 and 10x lower than Competitor 1
+ MCS has the tightest latency distribution



### **SYNCHRONOUS WRITES**



Throughput (16k random synchronous writes)

fusionio • udimm 4 • udimm 8 storage



### **SYNCHRONOUS READS**

#### udimm\_8 fusionio udimm\_4 (WB / sec) 4000 -000 -000 -000 -1000 -0 16 32 64 12 82 56 5 1 2 0 2 4 16 32 64 12 82 565 1 2 0 2 4 2 4 8 8 16 32 64 12 82 56 51 20 24 2 4 8 2 4 time (sec)

Throughput (16k random synchronous reads)

storage • fusionio • udimm\_4 • udimm\_8



### **SYNCHRONOUS MIXED**



Throughput (16k random synchronous reads and writes)

storage reads • writes ۲



## Linkbench MySQL Load



-Linkbench is CPU bound with MCS – more than 70% of CPU time is spent in USR -Linkbench is IO bound with Fusion – more than 70% of CPU time is spent in iowait

diablo 10/31/2013 | Diablo Technologies

حاك

## Linkbench



### -MCS based solution is not IO bound -Adding more CPU power WILL increase server productivity

diablo technologies1/2013 | Diablo Technologies

حاك

## MCS FOR VIRTUALIZATION

#### VIRTUAL MACHINE ACCELERATION

- Ultra low response times for virtualized applications
- Ideal as VM file swapping data store
- Fully certified technology for ESXivSphere platform

#### PERFECT SOLUTION FOR VSAN

- MCS eliminates the need for external storage arrays
- Extremely fast commit to clustered nodes for HA
- Predictable IOPS and latency for heavy workloads

#### CACHING SOLUTION FOR VIRTUALIZED ENVIRONMENTS

- Primary cache for hot data stored externally
- Ideally suited for random, mixed virtualized workloads
- Strong solution for virtualized relational and in-memory databases

#### REDUCED TOTAL COST OF OWNERSHIP

- Requires 1/6 typical memory/VM as MCS offsets DRAM
- Allows up to 4x VMs per host compared to PCIe-based flash
- Improved VDI user experience with lower Capex/Opex





## SOLVING THE I/O RESPONSE TIME ISSUE



K<sub>AVG</sub>: Total Kernel Time **D**<sub>AVG</sub>: Device Latency **G**<sub>AVG</sub>: Total VM Latency





## MCS AS FLASH TIER IN SPINDLE-BASED VSAN



VIRTUAL SAN

3 Cluster Nodes, 250 Linked Clone/Host, Full HA



## SUMMARY

### **Memory Channel Storage**

+ Leverages parallelism and scalability of the memory channel

Significantly reduces data persistence
 latencies and improves single thread throughput





### **Benefits of MCS**

+ 200GB to tens of TB's of flash in standard DIMM form factor and DDR3-CPU interface

 Disruptive performance accelerates existing applications and enables new flash use cases

 Scalability facilitates economic, "right-sized" system solutions

+ Form factor enables high-performance flash in servers, blades, and storage arrays

Future proofed with ability to utilize
 NAND-flash and future non-volatile memories



## **THANK YOU!**

### mamer@diablo-technologies.com



## MCS SYSTEM VIEW



Leveraging the **Power of Parallelism**...

+ Massive Flash capacity exposed through the low-latency memory subsystem.



### **PROBLEM SOLVED.**

### High Quality User Experience in "Noisy Neighborhood" Environments

#### Performance Stability

### Cost-Effective Hardware

### Improved Scalability

### Efficient Consolidation



# **MEMORY CHANNEL STORAGE** ECOSYSTEM





# MEMORY CHANNEL STORAGE REFERENCE DESIGN KITS (RDKs)

Modular, Reference Solutions for Enablement/Evaluation by SSD manufacturers, OEMs, and ISVs

### Each RDK includes:

- + MCS Chipset
  - + Enables hardware interface via Memory Channel
  - + Includes full firmware

### + MCS Drivers

- + Manages communication between Host and MCS Module(s)
- + Diablo drivers for Windows, VMware ESXi and popular Linux distributions/kernels

### + Storage Subsystem

- + Reference Non-Volatile Memory (NVM) solution
- + Final NVM solution will vary according to SSD Manufacturer/OEM preference

## MEMORY CHANNEL STORAGE **CARBON**<sub>1</sub>

- + The First Commercialized MCS RDK
  - + Enables NAND Flash to Directly Interface on the Memory Channel
- + Presents as a Block I/O Device
  - + Can be Managed just like Existing Storage Devices
- + DDR3 Interface, Standard RDIMM Physical Form Factor
  - + Plugs into Standard DIMM Slots
  - + Self-contained, No External Connections Required





# MCS CARBON<sub>1</sub>: SYSTEM REQUIREMENTS & COMPATIBILITY

### + Hardware and BIOS Requirements

- + Server enabled with MCS UEFI BIOS modifications
- + DDR3-compatible processor
  - + Compatible with standard JEDEC-compliant 240-pin RDIMMs
  - + Supports DDR3-800 through DDR3-1600
- + 8GB of standard memory (RDIMM) installed in the system
- + Follows standard server DIMM population rules

## + Initial OS Support

- + Linux (RHEL, SLES)
- + Windows Server
- + VMware ESXi





## ANATOMY OF AN ISV ENGAGEMENT: PERCONA

+ Percona Tested Memory Channel Storage devices

PERCONA

- + Percona is oldest and largest independent MySQL provider
- + Experts in MySQL and InnoDB Performance
- + Serving more than 2,000 customers in 50+ countries
- + Provide and support Percona Server MySQL distribution
- + Performance Consulting
  - + MySQL architecture and design reviews
  - + Diagnosing and solving MySQL performance problems
  - + Optimization of MySQL on SSD infrastructure
  - + Performance Audits to identify performance improvements

# + Diablo ISV Partnering Analysis identified Percona as critical partner



## ANATOMY OF AN ISV ENGAGEMENT: PERCONA

PERCONA

- + Percona Memory Channel Storage Testing
  - + Tested Carbon<sub>1</sub> reference design
  - + Tested ULLtraDIMM Carbon<sub>1</sub> based product
- + Benchmark Testing
  - + Sysbench Benchmarks
  - + Linkbench
  - + Metrics Measured
  - + Reads/Writes/Mixed Workload
  - + Throughput
  - + Operations per Second
  - + 95th Percentile Response Time



## NVMe\* vs. MCS: Weriter Request Flow

| NVMe Write Request Flow       Latency**         Block layer provides driver with pointer       1-2μs         block layer provides driver with pointer       1-2μs         briver pushes command (includes       1-2μs         pointer) into NVMe submission queue       <1μs         Image: Note the command from NVMe       Depends on         NVA. THIS STEP DOES NOT       EXIST IN THE MCS FLOW. | pointer to<br>Call]<br>es pointer) |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------|
| Driver pushes command (includes pointer) into NVMe submission queue [Memory Transaction]       <1μs                                                                                                                                                                                                                                                                                                  | call]<br>es pointer)               |
| pointer) into NVMe submission queue                                                                                                                                                                                                                                                                                                                                                                  |                                    |

• Memory transactions (and transactions occurring within device hardware) are very deterministic and faster than I/O DMAs

• I/O DMAs involve the I/O controller and are non-deterministic (subject to conflicts with other system I/O)

\*NVMe flow depicted since the current PCIe flow (through SCSI stack) is commonly accepted as inefficient.

technologies/2014

## NVMe\* vs. MCS: Read Request Flow



· Memory transactions (and transactions occurring within device hardware) are very deterministic and faster than I/O DMAs

**I/O DMAs** involve the I/O controller and are non-deterministic (subject to conflicts with other system I/O)

\*NVMe flow depicted since the current PCIe flow (through SCSI stack) is commonly accepted as inefficient.

technologies/2014