Understanding Container Checkpointing

Hemant Rawat
3 min readJul 4, 2024

--

Containers are designed to be stateless in nature, however there are certain workloads like HPC, AI/ML that are long running jobs and may require the state to be preserved.

Container checkpointing is the process of saving the state (externalized to caches/databases) of a running container to disk. This includes the container’s memory, process states, network connections, and other essential data.

Checkpoint can be applied for:

  • Backup runtime states for stateful containers
  • Containers with long start-up / preparation time
  • Containers with long computation for storing incremental progress
  • Container live-migration from one host to another

How is container checkpoint done?

On Linux, multiple components are essential to the checkpoint and restore process:

  • CRIU: provides checkpoint / restore capabilities of Linux processes
  • runc: is a binary to provides easy user access for cgroup. It also provides build-in checkpoint / restore capabilities for containers
  • cgroup v2: builds the foundation for containers in Linux Kernel and offer freeze and unfreeze capabilities in Linux Scheduler that are used in container checkpoint

CRIU (Checkpoint/Restore in Userspace)

Checkpoint/Restore In Userspace, or CRIU is an open-source project that can checkpoint and restore any Linux process with build in container support. It can freeze a running container (or an individual application) and checkpoint its state to disk. The data saved can be used to restore the application and run it exactly as it was during the time of the freeze. CRIU invokes existing Linux commands to serialize dumps of memory, metadata and network connections to create an image. (CRIU is written in C and compiles to a single binary)

  • CRIU has built-in de-duplication functionality to reduce image size
  • The image created doesn’t include the RootFS (Container image)
  • It provides a build-in “page-server” to support “live-migration” and “lazy-migration”, instead of serializing to disk

Process:

  1. Seize process using ptrace()
  2. Collect process tree and freeze it details from /proc/<PID>/*
  3. Inject Parasite code and collect process resources
  4. Serialize data into binary files (images) with protobuf
  5. Restoring: remove parasite code and detach from the process clone() for each PID/TID

RUNC

runc is an open-container-initiative project that orchestrates containers on Linux by communicating with cgroup v2. It assumes that the RootFS is already mounted. All runc does is to create a namespace and invoke the process that needs to be run. Think of runc as a pass-through communication layer for upper technical components to communicate with cgroup v2. It provides build-in checkpoint and restore capabilities. (It is written in Golang and compiles to a single binary).

When using runc to checkpoint a container, the path to CRIU needs to be passed in as an argument. runc invokes CRIU to checkpoint the parent process of the container (which includes all children processes). runc is responsible for making system calls to freeze and resume the container process

Summary

CRIU is very stable when it comes to process serialization and memory dumps, however network connection checkpoint is extremely unstable.

Even after de-dup, the size of checkpoint images is still highly dependent on the RAM used. For an application using GBs of RAM, the image will likely be GBs in size. Page server and lazy migration help slightly, but the end-to-end time taken by migration is still measured in seconds and not millisecond. Container image (RootFS) needs to be transmitted separately.

Docker has support for Checkpoint and Kubernetes introduced native integration of container checkpointing that enables dynamic relocation, scaling-out, and load-balancing of microservices as well as fast startup times, forensic analysis, and fault-tolerance of stateful applications.

Checkpointing for accelerated containers require additional considerations as GPUs work in passthrough mode making Linux kernel feature unusable.

References:

https://criu.org/Main_Page

--

--