Kubernetes Topology Manager & CPU Manager
Some workloads like Telco, High-performance computing (HPC), Machine Learning (ML) require dedicated CPUs, devices, Huge Pages memory, on the same NUMA nodes for optimal, low-latency executions.
In above example ML training workloads are blocked on CPU performance because of heavy IO and data processing. GPU utilization is not high enough.
Let’s explore what options Kubernetes (K8s) provides for scheduling such demanding workloads.
Workloads where CPU cache affinity and scheduling latency is important, kubelet allows alternative CPU management policies to determine some placement preferences on the node. K8s CPU manger is designed to allocated exclusive CPUs to containers in Guaranteed Pods with integer CPU requests
CORE CONCEPTS:
NUMA: CPU & Memory pairs are called NUMA nodes
K8s CPU Manager
The hardware requirements for a Pod in K8s is defined in the Pod’s specifications, including minimum (request) and maximum (limit) resources for CPU & memory.
When a new Pod is requested, it is placed in a “pending” state and the scheduler assigns it to a worker node based on the Pod’s constraints, the scheduler configuration, and the worker’s available capacity.
Once the Pod has been assigned to a worker node by the scheduler, the local kubelet will create the Pod, and will allocate the requested resources. The kubelet uses the Pod’s specification and its own local config to determine how to allocate the resources.
The kubelet uses Completely Fair Scheduling (CFS) to schedule the Pods, which is the default scheduler for normal call tasks in the Linux kernel. CFS aims to maximize CPU utilization by allocating CPU time to process, however this approach can negatively impact the performance of applications that require exclusive CPU allocation. To address this, K8s offers an alternative approach by using CPU Manager to assign exclusive CPU to Pods.
CPU Manager is a feature that provides granularity in CPU allocation logic for workloads. The feature offers an interface to set CPU manager behavior by assigning one of two options to “CpuManagerPolicy””
- none — This is default policy, which uses default CFS scheduler
- static — This option allows for the assignment of exclusive CPUs to Pods from a reserved CPU list. To qualify for exclusive CPUs the Pod’s requested resources must be equal to the limit and must be an integer value.
To maximize the benefits of CPU Manager, it is important to properly isolate the exclusive CPU list. It can be done either by:
- Isolating the exclusive CPU list from the Linux scheduler using the “isolcpus” kernel argument
- Using the “reserveSystemCPUs” kubelet config directive. The kubelet will schedule pods that do not require guaranteed and exclusive CPUs from this list.
There are three CPU Manager policy options:
- “full-pcpus-only”: This policy guarantees that CPU thread on simultaneous multi-threading (SMT) enabled systems (hyperthreading), that share the same physical core will not be assigned to different containers (or pods, depending on topology-manager-scope). CPU threads share the same L1 and L2 caches. So, scheduling two processes to isolated threads on the same physical core could lead to high cache miss rate.
- “distribute-cpus-across-numa”: This policy aims to evenly distribute CPUs across [N+1] NUMA nodes when the workload cannot fit within N NUMA nodes, starting from N=1. By default, the schedluer will tend to fully utlize one NUMA node before moving on to the next node, this could result in a situation where the workload could be running in an unbalanced NUMA topology. In contrast, the node with distribute-cpus-across-numa enabled policy will distribute the CPU evenly across two numa nodes, if it could not fit the pod in one NUMA.
- “align-by-socket”, some CPU architecture allow the configuration of multiple NUMA nodes per socket. By default, CPU Manager aligns the CPU allocation by NUMA boundary and tries to place the workload in a small number of NUMA’s as much as possible. align-by-socket policy can be used to broaden the set of preferred CPUs, at the same time, it ensures that the CPU Manager will not allocate CPUs from different sockets.
CPU Manager policies are local to workers, K8s scheduler is unaware of the CPU policy configured on workers, pods can be scheduled to specific worker and get terminated due to CPU policy configured on the target node. There is no interface available to pods to choose the CPU policy that they want. For that it is important to use combination of below:
- Tag the worker node with their configured CPU policy, and use the tags as selector in the pods
- Use the Topology aware scheduler
K8s Topology Manager
K8s Topology Manager enables a mechanism to coordinate fine grained hardware resource assignments for different components in Kubernetes.
Topology manager allows users to align their CPU and peripheral device allocations by NUMA node. This requires device plugin extended to integrate with the Topology Manager.
Topology Manager scope can be at POD level or Container level (default) and has four allocation policies. This is controlled by via a kubelet flag — topology-manger-policy.
- None: default policy that doesn’t perform any topology alignment
- Best effort: attempts to align resources optimally on NUMA nodes
- Restricted: attempts to align resources optimally on NUMA nodes or pod admission fails
- Single-numa-node: attempts to align resources optimally on a single NUMA nodes or pod admission fails
Hint Provider is a component internal to the kubelet that coordinates and align resources allocations with the TopologyManager.
To support Topology Manager, device Plugin API was extended to include new types like TopologyInfo and NUMANode
Device plugin that wishes to leverage the TopologyManager can send back a populated TopologyInfo struct as part of the device registration. The device manager will then use this information to consult with the Topology Manager and make resource assignment decisions.
Memory Manager: Policy can be none (default) or static. Static policy can be used to make the following memory reservations and mange memory map:
- Kube-reserved
- System-reserved
- Eviction-threshold
K8s CPU Manager and Topology Manager can be used for latency sensitivity, high performant workloads. Note that align-by-socket policy is not compatible with TopologyManager’s single-numa-node policy. If later is used, a more restrictive policy will be enforced. full-pcpus-only and distribute-cpus-across-numa policies can be combined together. Similarly, full-pcpus-only and align-by-socket policies can be used together in the same node.