Linux Memory Management — All you need to know
What is memory?
Fundamentally, memory is just a form of storage. Computers have many layers of storage — CPU Registers, CPU Cache, RAM, Disk. Access to each layer of memory consumes time, lesser the cycles faster the access.
Current 64-bit Intel architecture has 16 general purpose CPU registers (with additional FPU and MMX registers). Memory addresses are size of a CPU register. 32-bit systems can address up to 4GB of memory (2³² different addresses), 64-bit systems can address up to 16EB of memory (2⁶⁴ different addresses).
Early computer manages direct access to physical memory.
Linux divide memory into “Zones” — 32 bit and 64-bit have different memory zones.
32-bit
- ZONE_DMA: Low 16Mbyte for DMA-suitable memory by ancient ISA devices
- ZONE_NORMAL: RAM from 16Mbyte up to 896MB
- ZONE_HIGHMEM: All RAM above 896MB
64-bit
- ZONE_DMA: Low 16Mbyte for DMA-suitable memory by ancient ISA devices
- ZONE_DMA32: From 16MB to 4GB for DMA-suitable memory in a 32-bit addressable area
- ZONE_NORMAL: All RAM above 4GB
- ZONE_MOVABLE: To deal with fragmentation — map of movable pages
Memory allocations may come from a more restrictive zones, if that's where the free memory is available.
Zones and memory are attached to a ‘node’. Large servers may have more than one node. ( use /proc/zoneinfo to find information about each memory zone). Each zone have specific attributes, it dictates whether or not memory can be allocated (thresholds/watermarks) and how memory reclaim will behave.
Memory (de)allocation may be different for entities like user mode processes, drivers, etc. To satisfy memory allocation requirements Linux uses different memory allocators. It checks for free memory threshold in required zone (watermarks), if allocation fails, the kernel go through page rotation and tries to free memory.
Buddy System Allocator: Each zone is divided into 11 orders sized chunks: 2⁰, 2¹, 2², …, 2¹⁰. Largest size of continuous memory is (2¹⁰ x page size) — 4MB (4KB page) (/proc/buddyinfo).
Nowadays main memory is not allocated directly, only virtual memory is.
Virtual Memory
The kernel has full access to the system’s memory and must allow processes to safely access this memory as they require it. Often the first step in doing this is virtual addressing, usually achieved by paging and/or segmentation.
Virtual addressing allows the kernel to make a given physical address appear to be another address, the virtual address. Virtual address spaces may be different for different processes; the memory that one process accesses at particular (virtual) address may be different memory from what another process accesses at the same address. This allows every program to behave as if it is the only one (apart from the kernel) running, and thus prevents applications from crashing each other.
Virtual addressing also allows creation of virtual partitions of memory in two disjointed areas, one being reserved for the kernel (kernel space) and the other for the applications (user space). The applications are not permitted by the processor to address kernel memory, thus preventing an application from damaging the running kernel.
Virtual memory is mapped to RAM and is allocated to processes. Linux kernel creates layer of abstraction and indirection.
- Physical (main)memory is provided as a map to the OS
- This map is divided into page frames of 4KB each
- Each GB of memory is ~250K pages
Main memory is never allocated directly-only virtual memory that is mapped to the main memory. Virtual memory is allocated to processes.
The virtual memory is mapped back to the real memory via page tables. A page table is a per process data structure.
These virtual address are translated to physical addresses by Memory Management Unit (MMU). When accessing a page, the processor sends the virtual address to the MMU, and the MMU fetches the Page Table entry.
To reduce the size of virtual memory we chunk memory into pages & frames and use a lookup table.
Page & Frame size are identical. Page points to physical memory frames. Each page/frame = 4KB.
Page Table maps page number to frame number. Page Table is stored inside the physical memory (in kernel space).
Problems with Page Table:
- Each Page translation requires a page table lookup
- Two memory access are required for each data access
- Two memory access are required for each instruction (because programs are stored in main memory)
To speed up the memory access, the CPU implements a in-CPU cache called Translation Lookaside Buffer (TLB). It is associative memory inside CPU with constant-time lookup. The TLB maps virtual page numbers to physical page numbers, but as it is a cache it contains only the latest accesses page table entries. It is fast, expensive, and size limited (typically 8–4096 entries).
Virtual Memory Allocation:
Kernel allocates some memory to processes as soon as they are created. A process’ virtual memory area (Process Address Space) is shown below:
User processes can also request memory to be dynamically allocated during execution using malloc(), brk() and mmap() sys calls.
Virtual Memory Allocation is executed through paging, more specifically on-demand paging. It allows to allocate more memory than physically available through overcommit and may result in anonymous paging and swapping. If memory cannot be allocated due to memory pressure — OOM Killer is triggered. Lets explore each of them in details.
Paging
Paging is basically the movement of pages in and out of the main memory and storage. It allows partially loaded & programs larger than memory to execute.
Unlike Swapping which moves entire program in and out, Paging only moves pages, which are relatively small (e.g., 4KB)
There are many types of paging such as File System paging and Anonymous paging. Lets look into them.
File System Paging
It involves reading/writing of pages in memory mapped files (mmap()) and on the file systems that uses page cache.
If file system page has been modified in main memory, it is a “dirty” page, and requires a write to disk. If it is not modified, or a “clean” page, then then page out just frees the memory immediately.
Anonymous Paging
Anonymous paging is private to processes (process’ heaps and stacks). It is called anonymous due to lack of a named location in the operating system, such as file system path.
Anonymous page outs require the data be moved to the physical swap devices or swap files — ‘swapping’. Anonymous paging, or swapping, hurts performance, and is thus consider a “bad” paging.
Applications that requires access to the anonymous pages, that have been paged out, require anonymous page in, which blocks I/O call to the disk.
Page outs themselves may not negatively affect performance as they can be done asynchronously, while page ins are synchronous.
On Demand Paging
On demand paging is the act of mapping pages of virtual memory to main memory on demand. It defers CPU overhead of creating mapping until they are needed and accessed, instead of when memory is first allocated.
A page fault occurs when a page is accessed that has no page memory from virtual memory to main memory.
If the mapping can be satisfied from another page in memory, it is called a minor fault which may occur for mapping a new page from available memory. It can also occur when mapping to an existing page, such as reading a page from a shared library.
The UNIX virtual memory has 4 states for a page:
1. Unallocated
2. Allocated, unmapped (Unpopulated and not yet faulted)
3. Allocated, mapped to main memory
4. Allocated, mapped to swap device
State 2 is the default state , transition to state 3 is a page fault and if it requires disk IO, it is a major page fault, otherwise minor page fault.
State 4 is reached if the page is paged out due to the system memory pressure.
We can define two memory sets from these states:
- Resident Set size, or RSS, it is the size of allocated main memory pages (state 3)
- Virtual memory size, it is the size of all allocated pages (states: 2+3+4)
Overcommit
Overcommit allows more memory to be allocated than physically available. (more that the main memory + swap). It is dependent on on-demand paging and on applications not using more than a minority of allocated memory.
It allows for malloc () requests to succeed instead of failing as system will rarely decline requests for virtual memory.
Consequences of overcommit depend on tumbles and how kernel manages memory pressure — most frequently you’ll see OOM Killed (Out of memory killer).
Swapping
As previously discussed, swapping is act of moving entire process between main memory and the swap device of file. Thread structure, heap, stack, must be swapped. Ata from file systems that is unmodified can be dropped.
Processes that are swapped out are still known by the kernel as metadata is still resident in kernel memory.
Kernel prioritizes swapping is based on various factors like thread priority, wait time, size of process. The longer it has been waiting and the smaller it is, the higher in the queue it will be.
Modern Linux does not perform traditional swapping at all, it uses paging operation on a swap device or file instead. Some UNIX systems still perform actual swapping.
In Linux, Kernel uses various caches to optimize performance.
File Systems Cache
True free memory is not useful and it does nothing, so the OS will attempt to utilize spare memory to cache file system. Kernel is also able to quickly free memory from file system cache. This processes is transparent to applications . Logical I/O latency is much lower, as requests are being served from main memory.
Cache grows over time and “free” memory shrinks. Regular caching is used to improve read performance and buffering inside the cache is used to improve write performance.
Page cache
Buffer cache is stored in the page cache in modern Linux and is used for disk I/O to buffer writes.
It is dynamic and current cache size can be checked in /proc/meminfo. Page cache is used to increase directly and file I/O and virtual memory pages and file system pages are stored in it. Dirty file system pages are flushed by flusher threads (flush), per device processes.
It happens after:
- An interval (default 30s)
- Sync(), fsync(), msync() system calls
- Too many dirty pages (dirty_ratio)
- No available page cache pages
If there is a system memory deficient, kswapd will look for dirty pages to be written to disk. All I/O goes through the page cache unless explicitly set not to do so — Direct I/O. This can result in all writes being blocked if the page cache has completely filled. When all writes are blocked, operating system have a tendency to stop.
Dropping Cache
It is possible to drop the page, dentry (directory entry cache), and inode caches in Linux, either to forcefully free up memory, or to test file system performance prior to anything being cached.
To drop the page cache, use “Echo 1 > /proc/sys/vm/drop_caches”
To drop dentry and inode caches use “Echo 2 /proc/sys/vm/drop_caches”
To drop both use: “Echo 3 > / proc/sys/vm/drop_caches”
OOM Killer
Linux has quite a few methods to manage memory — paging, shrinking caches, removing things from caches, and so on.
Sometimes this isn’t enough — enter OOM Killer.
OOM Killer will sacrifice processes to keep the system online — this also kills any processes that share the mm_struct as the selected process.
It can make things immune by adjusting /rpoc/<pid>/oom_adj to -1000 +1000 puts a target on the processes’ head. Process with the highest OOM score (/proc/<pid>/oom_score) is sacrificed. OOM score calculation is basically “How much of the available memory to the process is actually in use?” — 100% would result in a score of 1000.
Root owned processes get a slight handicap — 30 is subtracted from the OOM score.
It is only triggered for low order allocations, e.g., 2³ or less
- Pages are allocated in powers of 2 — so a 3rd order allocation would be 2³ (8) pages, with the total size being determined by your page size
What causes this?
Most often, the system really is out of memory. If /proc/meminfo is showing swapFree and MemFree to ~1% or lower, this is likely the case.
(Much) More rarely, kernel data structure or memory leak can be the culprit — check /proc/meminfo for SwapFree and MemFree, and then /proc/slabinfo — telltale signs can be task_struct objects being high could indicate the system forking so many processes it ran out of kernel memory. You can also see the object utilizing most of the memory.
SwapFree can be misleading when a program uses mlock() or HugeTLB — It cannot be swapped if these are in use. SwapFree will not be relevant on most instances with default setups — very few have swap enabled.
Again, most cases are the system actually running out of memory. Tracking process memory usage and finding the offender is important.
Can also be triggered due to specific memory allocation requirements:
- Specfic Memory zone
- Specific GFP Flag
- Specific allocation Order
Segmentation Fault (segfault)
Segfaults are access violations. Hardware with memory protection will notify the OS that a memory access violation has occurred. This might be caused by trying to read a part of memory that the application is not allowed to access, or trying to use a section of memory in a way that is not allowed, such as trying to write to read-only memory.
Ultimately caused by software errors, most often seen in C programs where pointers reference a portion of virtual memory they are not allowed to access.
Some programs have exception handling built in for segfaults, but more frequently do not, and segfault will result in the process crashing and potentially generating a core dump
- Core dumps are files containing a process’s memory address space at a specific time — in this case, the time of the crash. In practice, you often see other pieces of the program state also dumped, such as processor registers, which often include the program counter and stack pointer, general memory management information, and other processor and operating system flags.
Page Allocation Failure
Implies that the system has failed to allocated a page.
It can be caused by memory segmentation — the available memory is so fragmented that there is not enough contiguous space to allocated pages that require contiguous space.
It can also be caused by a general lack of memory — as we discussed earlier, OOM Killer doesn’t trigger on high order allocations. In such a case that you are trying to allocate a larger set of pages than would trigger OOM Killer when low on memory, you might see this instead
Null Pointer reference
A pointer is a variable that contains another variable as the value — such as a memory address.
Occurs when a pointer is used pointing at a NULL value, when the assumption is made that it is pointing at a valid memory address
Almost always results in the process crashing, unless exception handling is built in, similar to with segfaults.
Machine Check Exceptions
MCEs are a hardware error, thrown when the CPU detects a hardware problem. Main potential causes are errors with the system bus, memory, and CPU cache.
Huge pages & Transparent Huge Pages
As discussed earlier, pages are generally 4KB in size, however, you can change this. Huge Pages allow for pages that are 2MB and 1GB in size.
As modern processors contain a limited set of page table entries — when you use larger pages, the processor can work with more memory, without failing back to the slower software memory management.
It requires applications to be aware and coded to take advantage of them.
Transparent Huge pages are an attempt to abstract this so that everything can take advantage of Huge pages — however, this can cause oddities in behavior on some applications, and some vendors such as Red hat explicitly state that they are problematic in certain workloads such as databases.
Setting /sys/kernel/mm/transparent_hugepapge/enabled to never will disable them.
Changing overcommit settings
You can modify the system overcommit behavior by modifying /proc/sys/vm/overcommit_memory
(You generally should not not modify this settings)
- 0: Heuristic overcommitting. Ensures “crazy” allocations fail while allowing more normal over allocation. (Default)
- 1: Always allows overcommitting
- 2: Never allow overcommitting. Total address space is limited to swap+configurable percentage of physical RAM. Percentage defaults to 50%, is set at /proc/sys/vm/overcommit_ratio
File System repair and Memory Requirements
When checking/repairing a file system, you can see fairly extreme requirements on memory, particularly when a large file system is involved. Specifics vary, but XFS for example is particularly onerous — SGI recommends 2GB of RAM per TB of space, and 200MB of RAM per million inodes. Can be worked around with using a large swap partition, or with fsck, a scratch file (setting a scratch file can be done in /etc/e2fsck.conf).
When in doubt, it is safer to side with more RAM and swap — repairs failing due to a lack of memory can be damaging to the data you are trying to save.
Tools
vmstat -Sm 1
ps aux | head -n 5
atop
sar
slabtop
References