Exploring NVMe over Fabrics and Storage Class Memory

Hemant Rawat
7 min readJul 20, 2024

--

Data storage paradigms

Image generation: Copilot

In the ever-evolving world of data storage, the quest for speed, efficiency, and scalability never ceases. Two cutting-edge technologies that are making significant strides in this domain are NVMe over Fabrics (NVMe-oF) and Storage Class Memory (SCM).

Overview

Non-Volatile Memory Express (NVMe) is a high-performance, scalable host controller interface designed to accelerate the transfer of data between enterprise and client systems and solid-state drives (SSDs). NVMe offers significant advantages over traditional interfaces like SATA and SAS by providing lower latency, higher input/output operations per second (IOPS), and enhanced parallelism.

NVM Express (NVMe) Interface provides fixed block read/write I/O interface to PCIe SSDs (drives) and external storage systems. It applies to all Non-Volatile Memory (persistently stores data, no moving parts e.g. Flash, 3D Xpoint, NRA) Flash and newer low-latency devices. It is essential for SSDs that use newer low-latency devices.

Storage Class Memory: is low-latency Non-Volatile Memory devices (3D XPoint is best know example). They are faster than flash and have better endurance (# of writes). They have similar to longer latencies and higher capacities that DRAM, at least initially.

Basics

Server Architecture — Classic

Initial server architecture has two types of data storage: DRAM Memory which is fast, volatile, byte level access and HDD Disk which is slow, nonvolatile, block access (“spinning rust”).

Classic Server Architecture

Data moves from disk to memory (fetched by CPU). CPU writes results to memory and data is stored back to disk for future use.

Server Architecture-Evolved

Evolved Server Architecture

NVM (initially NAND flash) reduces response time. First adoption was through SATA (and SAS), followed by PCIe/NVMe SSDs which exploit multi-core CPUs and avoid the need for separate drive controllers.

Initial NVDIMM-DRAM used for speed and data is de-staged to slower NVM, DRAM (battery or capacitor backed.)

NVMe

NMVe is the specification for SSD access via PCI express (PCIe), initially for flash media later extended to fabrics (e.g., InfiniBand, RDMA/Ethernet). It is designed to scale any type of Non Volatile Memory, including storage class memory.

Design Target for NVMe is achieve high parallelism and low latency SSD access. It does not rely on SCSI (SAS/FC) or ATA (SATA) interfaces. It has new host drivers & I/O stacks.

It has new modern command set of 64-byte commands (vs. typical 16 bytes for SCSI). It provides Administrative and I/O command separation (control path vs. data path) and has small set of commands allowing small fast host and storage implementations.

NVMe standards are developed by NVMe working group.

NVMe working:

It provides Memory-based deep queues (up to 64K commands per queue, up to 64K queues) and has simple command set (13 required commands). Command completion interface is optimized for success (common case). NVMe Controller is SSD element that processes NVMe commands.

NVMe Queues

NVMe introduces a multi-queue mechanism. This allows each CPU core to use an independent hardware queue pair to interact with the SSD. A queue pair consists of a submission queue and a completion queue. The CPU places commands into a submission queue, and the SSD places completions into the associated completion queue. The SSD hardware and host driver software control the head and tail pointers of queues to complete the data interaction.

NVMe Command flow
  1. Queue Command(s)
  2. Ring doorbell (New tail)
  3. Fetch Command(s)
  4. Process Command(s)
  5. Queue Completion(s)
  6. Generate Interrupt
  7. Process Completion
  8. Ring Doorbell (New Head)

Additional NVMe functionality

  • commands to create/delete, attach/detach namespaces (analogous to SCSI logical units). — Flat 32 bit namespace numbering (not SCSI hierarchy), host can access many namespaces
  • End-to-end data protection: same format as SCSI Protection Information (aka DIF, Data Integrity Field)
  • Support for multiple ports on drives and subsystems: most current NMVe drives (SSDs) are single port
  • TCG (Trusted Computing Group) data-at-rest encryption (self-encrypting drives), e.g., Opal Opalite.
  • Power management

NVMe Over Fabrics

NVMe-oF is a thin encapsulation of the base NVMe protocol across a fabric (no translation to another protocol (e.g., SCSI)). NVMe over Fabrics extends the NVMe protocol beyond the confines of a single server, enabling the use of high-speed networking technologies to connect NVMe devices. This extension allows for the creation of a disaggregated storage architecture, where storage resources can be shared and managed more efficiently across a network.

NVMe over Fabrics

NVMe over Fabrics: Server access to external storage

  1. IP/Ethernet RDMA: RoCEv2, iWARP (standardized RDMA transport)
  • RoCEv2: UDP/IP-based, required DCN (“lossless”) Ethernet,
  • iWARP: TCP/IP-based, better network loss tolerance
  • Hardware (RNIC) implementation preferred, RNICs support other protocols, e.g., SMB Direct, iSCSI (via iSER)

2. IP/Ethernet non-RDMA: TCP/IP

  • software based implementation, leverages TCP offloads in high-volume NICs

3. Fibre Channel (NVMe-FC transport standard)

  • largely compatible with current/future FC hardware, e.g., data transfer, but new firmware/drivers needed

The primary use case for NVMe PCIe SSDs is in all flash appliance. Hundreds or more SSDs may be attached.

NVMe in Fabric Environments

Differences from PCIe

  1. Queues: Fabric transport connections replaces shared memory
  • commands and completions use messages (capsules) instead of shared memory
  • Fabric specific data transfer e.g., RDMA

2. Controller and Queue initialization

  • Discovery controller replaces PCIe device enumeration (may be complemented by fabric-specific mechanism e.g. FC fabric nameserver)
  • New commands for controller configuration and queue setup (each queue pair (admin and I/O) uses a separate fabric transport connection)

3. Fabric Security

  • can authenticate fabric transport connections (TCG secure messaging based on shared secrets)
  • Fabric may provide secure channel e.g., IPSec

End-to-end NVMe over Fabric

End-to-end NVMe Over Fabrics

Benefits of NVMe-oF
Scalability: NVMe-oF enables the creation of large-scale storage networks, allowing organizations to scale their storage resources dynamically.
Performance: By leveraging high-speed networking and the NVMe protocol, NVMe-oF significantly reduces latency and increases IOPS.
Resource Utilization: Disaggregated storage architectures improve resource utilization, allowing for more efficient allocation of storage and compute resources.

Storage Class memory

Storage Class Memory (SCM) is a type of memory that bridges the gap between traditional DRAM and NAND flash storage. It offers a unique combination of low latency, high endurance, and persistence, making it an ideal solution for a variety of data-intensive applications.

Storage Class memory (SCM) enables large memory-access capacities, addresses latency, capacity, $/GB gap between DRAM and Flash. It enables addition of persistence to load/store memory (Fast system recovery times, fast write commits). It enables storage and network system/device options (Low latency block IO devices — SCM NVMe SSD).

Storage Class Memory
Storage Class Memory Impact
Storage Class Memory Technology Stack

Types of SCM
There are several technologies under the SCM umbrella, including:

Intel Optane: Based on 3D XPoint technology, Intel Optane provides significantly lower latency and higher endurance than NAND flash.
Phase-Change Memory (PCM): Uses the physical state change of materials to store data, offering fast read and write speeds.
Magnetoresistive RAM (MRAM): Utilizes magnetic storage elements, providing non-volatility and high-speed access.

SCM Technologies

SCM Connection Technologies

SCM Connection Configurations

Possible Configurations:

  • Processor (or offload device) expanded local memory — DRAM+SCM
  • Lower latency SCM SSDs via NVMe or NVMe over Fabrics
  • Rack level fabric — Capacity and BW scale independent of processor memory

Connection Technologies

  • GenZ — Memory semantic-based scalable fabric — plus storage and network gateway connectivity
  • CCIX (C6) — Offload engine connection fabric with support of memory and coherency
  • OpenCAPI — Open version of IBM CAPI (Coherent Accelerator Processor Interface)

GenZ

GenZ Architecture

GenZ provides following benefits:

  • memory semantics — simple reads and writes
  • From tens to several hundred GB/s of bandwidth
  • Sub-100 ns load-to-use memory latency
  • Real-time analytics
  • Enables data centric and hybrid computing
  • Scalable memory pools for in memory applications
  • Abstracts media interface from SoC to unlock new media innovation
  • Provides end-to-end secure connectivity from node level to rack scale
  • Supports unmodified OS for SW compatibility
  • Graduated implementation from simple, low cost to highly capable and robust
  • Leverages high-volume IEEE physical layers and broad, deep industry ecosystem

SCM-Programming Model

SNIA Persistent Memory Open Programming Model: Defines four programming modes: Legacy block mode, Legacy file mode, PM-aware block mode, PM-aware file mode.

SCM Implementations — DIMM form factor

  • NVDIMM-x defined devices (from JEDEC) (DIMM form factor, DDRx interface)
  • NVDIMM-N (DRAM backed up by flash)
  • NVDIMM-F (Flash SSD on DIMM
  • NVDIMM-P

Summary

NVMe provides scalable performance (~1GB/s per lane (PCIe Gen3), more lane-> increased performance). It provide low latency through direct CPU connection (via PCIe), it has reduced software overhead (new OS drivers and I/O stacks). It’s parallelism is matched to multi-core (& hyper-threaded) CPUs. Compared to SAS and SATA it does not require drive controller.

--

--