I/O Stack in Linux

13 min readJun 9, 2024

Block I/O Layer
  - Block devices
  - I/O Schedulers
  - New Paradigm & blk-mq
  - Drivers
File System Layer
  - What is FS?
  - FS Strucuture
  - FS Resources
  - FS Objects
  - Journaling
  - FS Comparision (ext2, ext3 and ext4)
Virtual File System Layer (VFS)
  - Main components
  - Important Structures

Before diving into the details if file systems, it is helpful to understand the measurement criteria for Storage performance. These criteria include following parameters commonly used:

IOPS: Number of I/O (read/write) operations performed within a second
Throughput: It is the amount of data transferred in a period of time (i.e. MB/s). It can also be seen as the product of the IO size and the number of IOPS (Throughput = IO size * IOPS)
Latency: It is the time (usually measured in ms) it takes for an I/O operation to be completed

Block Devices

Magnetic Rotational (HDD): HDDs are made up of one or more rotating disks (plates) paired with the magnetic heads, usually arranged on a moving actuator arm, which reads and writes data to the platter surfaces. Although random I/O operations can be performed, due to it’s mechanical nature those operations have a higher latency than the sequential ones.
Solid State Drives (SSD): It is a solid state non volatile memory. Their performance is typically much better than of a rotational disk and it is consistent across different offsets (no seek latency) and predictable for given I/O size. The random or sequential characteristics of workloads matters much less that with rotational disks.
NVMe: It is a non-volatile media (usually a flash memory) attached via a PCI express (PCIe) bus. NVM Express allows host hardware and software to fully exploit the levels of parallelism possible in modern SSDs, reducing I/O overhead and latency.

The Old paradigm

Historically, one of the biggest challenges for I/O schedulers was to try to put “in sequence” requests that were random, since for the magnetic disks (also called rotational devices), the random access was expensive in terms of latency, because it required the “arm” to move back and forth to locate a specific cylinder/sector. Due to its mechanical nature, the performance of a rotational device, it is not as predictable as SSD drive and therefore, the number of IOPS that you can get depends on the workload and in general is lower than what you can achieve with a solid state drive.

Access time = seek time + rotational delay

Avg. access time (10K rpm) = 9 ms + 3 ms = 11ms

(Note: for an SSD drive, the average access time is < 0.1ms)

I/O Schedulers

I/O scheduling usually has to work with HDDs that have long access times for requests placed far way from the current position of the disk head. To minimize the effect that this has on system performance, most I/O schedulers implement a variant of the elevator algorithm, that reorders the incoming random requests, so the associated data would be accessed with minimal arm/head movement.

I/O schedulers can have many purposes depending on the goals that they are pursuing; some of them are:

minimize time wasted by HDD seeks
prioritize a certain processes’ I/O requests
give a share of the disk bandwidth to each running process
guarantee that certain requests will be issued before a particular deadline

Types of I/O Schedulers

Noop: Noop scheduler is the simplest I/O scheduler for the Linux kernel. Since it does not perform any scheduling task (it is just a FIFO queue), it is the recommended option when the overhead of scheduling is deemed unnecessary. The Noop scheduler inserts all incoming I/O requests into a simple FIFO queue and implements request merging. This scheduler is useful when it has been determined that the host should not attempt to re-order requests based on the sector numbers contained therein. In other words, the scheduler assumes that the host is definitionally unaware of how to productively re-order requests. There are (generally) three basic situations where this situation is desirable. i) If I/O scheduling will be handled at a lower layer of the I/O stack. For example: at the block device, by an intelligent RAID controller, Network Attached Storage, or by an externally attached controller such as a storage subsystem accessed through a switched storage Area Network (SAN). Since I/O requests are potentially re-scheduled at the lower level, resequencing IOPS at the host level can create a situation where CPU time on the host is being spent on operations that will just be undone when they reach the lower level, increasing latency/decreasing throughput for no productive reasons. Because accurate details of sector position are hidden from the host system. An example would be RAID controller that performs no scheduling on its own. Even though the host has the ability to re-order requests and the RAID controller does not, the host system lacks the visibility to accurately re-order the requests to lower seek time. Since the host has no way of knowing what a more streamlined queue would “look” like, it can not restructure the active queue in its image, but merely pass them onto the device that is (theoretically) more aware of such details. Because movement of the read/write head has been determined to not impact application performance in a way that justifies the additional CPU time being spent re-ordering requests. This is usually the case with non-rotational media such as flash drivers or Solid-state drives. This is not to say NOOP is necessarily the preferred I/O scheduler for the above scenarios. As with any performance tuning, all guidance will be based on observed work load patterns (undermining one’s ability to create simplistic rules of thumb). If there is contention for available I/O bandwidth from other applications, it is still possible that other schedulers will generate better performance by virtue of more intelligently carving up that bandwidth for the applications deemed most important. For example, with a LDAP directory server a user may want deadline’s read preference and latency guarantees. In another example, a user with a desktop system running many different applications may want to have access to CFQ’s tunables or its ability to prioritize bandwidth for particular applications over others (ionice).
Deadline: The main goal of the Deadline Scheduler is to guarantee a start service time for a request. It does so by imposing a deadline on all I/O operations to prevent starvation of request. It also maintains two deadline queues, in addition to the sorted queue (both read and write). Deadline queues are basically sorted by their deadline (the expiration time), while the sorted queues are sorted by the sector number. Before serving the next request, the deadline scheduler decides which queue to use. Read queues are given higher priority, because processes usually block on read operations. Next, the deadline scheduler checks if the first request in the deadline queue has expired. Otherwise, the scheduler serves a batch of requests from the sorted queue. In both cases, the scheduler also serves a batch of requests following the chosen request in the sorted queue. By default, read requests have an expiration time of 500 ms, write requests expire in 5 seconds.
Anticipatory: an enhanced version of deadline, with heuristics to anticipate I/O performance, improving global throughput. These can include pausing for milliseconds after reads rather than immediately servicing writes, on the prediction that another read request may arrive during that time for a nearby disk location.
CFQ: The completely Fair Queueing scheduler allocated I/O time slices to processes, similar to CPU scheduling, for fair usage of disk resources. It also allows priorities and classes to be set for user processes, via the ionice command. CFQ is an I/O scheduler for the Linux Kernel. CFQ places synchronous requests submitted by processes into a number of per-process queues and then allocates time slices for each of the queues to access the disk. The length of the time slice and the number of requests a queue is allowed to submit depends on the I/O priority of the given process. Asynchronous requests for all processes are batched together in fewer queues, one per priority. While CFQ does not do explicit anticipatory I/O scheduling, it achieves the same effect of having good aggregate throughout for the system as a whole, by allowing a process to idle at the end of synchronous I/O thereby anticipating further close I/O from that process. It can be considered a natural extension of granting I/O time slices to a process.

# cat /sys/block/xvda/queue/scheduler
noop deadline [cfq]
# echo "noop" > /sys/block/xvda/queue/scheduler
# cat /sys/block/xvda/queue/scheduler
[noop] deadline cfq

The new paradigm

Virtualization: In a virtualized environment you have others schedulers besides the one you have in your OS and therefore, the requests could be rescheduled by another I/O scheduler. Also, when you have several layers of virtualization, concepts such as “cylinders” become a little fuzzy.

SSD drives/NVMe: For a SSD drive, the random access is not expensive as in the case of the rotational devices, since you don’t have any arm, cylinder or plate. The I/O performance of storage device has accelerated today.

Filesystem Layer

A filesystem is a structure that groups data (using different filesystem objects) and defines how it will be stored and retrieve.

Providing the file system is one of the most important roles of the operating system.

Filesystem Resources

Inodes: are the basic structure that contain the information about a single physical file. A file can be a directory, a socket, a named pipe, character or block device, symbolic link or just a regular file. It is a block of information related to an entity, describing its location on disk, its size and its owner and other metadata information.

Extent

The traditional indirect block scheme could be efficient for small or sparse files. However, it causes high metadata overhead and poor performance when dealing with large files (specially for delete operations). In order to address this issue, ext4 uses extent mapping as an efficient way of representing large files. An extent is a single descriptor that contains a range of contiguous blocks. A single extent can represent up to 128MB and up to 4 extents can be stored directly in the inode structure. For every large or highly fragmented files, an extents tree is created.

Filesystem Objects

Directory

Journaling

Operations aren’t atomic. “rm /etc/file” removes directory entry, release the inode (free inodes pool) and release the data blocks (free blocks pool).

It keep track of changes not yet committed in order to protect the FS consistency.

Changes can be:

Metadata: FS consistency, potential data corruption
Metadata & Data: FS and data consistency

Disk space allocated for it: file, special space, different drive. It is usually implemented as circular log with Checkpoints and transactions. Recovery consists in replaying uncommitted journal transactions.

# dumpe2fs /dev/xvda1 | grep -i jour
Filesystem features:  has_journal ext_attr
resize_inode dir_index filetype needs_recovery
extent flex_bg sparse_super large_file huge_file
uninit_bg dir_nlink extra_isize
Journal inode:  8
Journal backup:  inode blocks
Journal features:  journal_incompat_revoke
Journal size:  128M
Journal length:  32768
Journal sequence:  89

FS comparison — extX

ext2:

Here space is split up into block groups, each has its own data blocks and inodes. ONe can pick own block size. It has little write operations (no journal) and undelete is possible.

ext3

There is Journal. “data=[journal | ordered | writeback]”. It supports online resizing and is backward compatible with ext2. Here undelete is NOT possible as inodes are wiped on delete.

ext4

It has Journal checksum. Extents replaces block mapping schema. Pre-allocation, delayed allocation multiblock allocator is supported. It has fixed block size (Page sized) and is backward compatible with ext [2|3].

Virtual File System (VFS)

VFS is a kernel interface for abstracting file system types from system calls. VFS introduces a common file model to represent all supported filesystems. Syscalls don’t care (and don’t know) the implementation details of the underlying filesystem. All other filesystems must map their own concepts into the common file model (e.g. FAT filesystem do not have inodes or superblocks). All file access is done through VFS (ramfs, tmpfs, and pseudo-file systems like /proc, /sys, etc.)

The main components of the common file model are:

Superblock (information about mounted filesystem)
Inode (information about a specific file)
File (information about an open file)
Dentry (information about directory entry)

Other components include:

file_system_type: describes a FS and its capabilities, used to register a FS in the Kernel
vfsmount: contains information like mount point, location, flags, etc.

(vfsmount is there to allow Linux to mount the same filesystem in different mount points. The same superblock can correspond to many vfsmount.)

struct file_system_type

It represents an abstract file system of a specific type NOT a specific_instance of this file system. It describes file system name, flags and provides methods to get/destroy superblock

# lsmod | grep xfs
# ls /sys/fs
bpf  cgroup  ext4  pstore
# mount -t xfs /dev/xvdj /rescue
ls /sys/fs
bpf  cgroup  ext4  pstore xfs
lsmod | grep xfs
xfs    1400832 1
libcrc32c    16384 3 nf_conntrack,nf_nat,xfs

This struct represents the file system and is used to register the file system in the Kernel using register_filesystem() function.

It is created when specific file system type module is loaded. Flags like whether device must be specified or whether it can be mounted in user visible name space. It is file system driver problem now to return the superblock that would conform to the VFS standards.

# cat /proc/filesystems
nodev   sysfs
nodev   tmpfs
nodev   bdev
nodev   proc
nodev   cgroup
nodev   cgroup2
nodev   cpuset
nodev   devtmpfs
nodev   binfmt_misc
nodev   debugfs
nodev   tracefs
nodev   sockfs
nodev   bpf
nodev   pipefs
nodev   ramfs
nodev   hugetlbfs
nodev   rpc_pipefs
nodev   devpts
        ext3
        ext2
        ext4
nodev   nfs
nodev   nfs4
        xfs

struct super_block

It keeps high level metadata of a mounted file system:

available inodes, free blocks, etc.
super_block structure is populated with the superblock from disk (if any)
pointer to the root inode of the file system (necessary for lookup operations)

This structure keeps s_ops aka superblock operations table:

pointer to FS specific functions that operate on the filesystem and its inodes. Example: read_inode(), statfs(), etc.
Some operations are optional (NULL)

All super_block structs are part of linked list. *s_op is a structure that keeps pointers to the FS specific superblock operations, for example read_inode (inode)

struct inode

It keeps all kernel required information to manipulate a file or directory:

represents a file in the file system
Inode number uniquely identifies a file within the FS
Mode, owner, a/m/ctime, super_block pointer

On Unix-like FS the structure is populated with the on disk inode information. The set of available operations VFS can run on a given inode are provided by the I_op pointer to an Inode Operations table.

struct dentry

It is an abstraction for directory entry. Each component in the path becomes a dentry. Dentries describe the tree structure of the FS. All dentries have a parent, except the root dentry. Dentry objects exist only in memory and are not stored on disk. Dentry objects are used to improve performance of FS operations. They are stored in the slab allocator. For performance purposes these are kept in the dcache (reclaimable slab cache). Accessing file /etc/fstab requires access to dentries “/”, “/etc” and dentry for “/etc/fstab”. These determines need to be created if they aren’t in the cache in order to get to fstab file. Inodes are also cached in memory.

VFS — File Object

This struct represents a file opened by a process. In memory representation of an open file, doesn’t correspond to any on disk data. This object gets created as result of a open() syscall. It represents a process view of an open file (multiple process can open the same file). The struct file keeps (among other things):

a pointer to the related dentry that points to the inode backing the file
keeps the offset of the file
access mode

(In new kernels the file object has pointer to struct path and then from there only one get to struct dentry; also it has cached inode pointer; struct task_struct -> struct files_struct -> struct fdtable (fd array) -> struct file(“context”, mode, position) -> struct path -> struct dentry /include/linux/fs.h)

There are some limits of filesystems that needs to be considered:

Max number of opened files system wide:

file-max
file-nr
Error “VFS: file-max limit <number> reached”

# cat /proc/sys/fs/file-max
9223372036854775807
# cat /proc/sys/fs/file-nr
1696    0       9223372036854775807

Max number of files a process can open (Actual limit depends on RLIMIT_NOFILE resource limit):

nr_open

# cat /proc/sys/fs/nr_open
1048576

Max number of inodes the kernel can allocate

inode-state
inode-max
inode-nr (first two items of inode-state)

# cat /proc/sys/fs/inode-state
30155   0       0       0       0       0       0

I/O Stack in Linux

Written by Hemant Rawat