Xen Virtualization — time to revisit the fundamentals

Hemant Rawat
10 min readJun 16, 2024

--

Image generation: DALL E 3

Virtualization is a process for running a virtual instance of a computer system in a layer abstracted from the actual hardware. These virtual instances of a computer systems are referred to with many names — guests, virtual machines, domains, instances, etc. They runs on the same underlying hardware referred to as host, are completely isolated and not aware of one another.

Virtualization Types

  • OS Virtualization: Containers, Jails, Zones
  • Full Virtualization: hypervisor completely emulate the behavior of real hardware
  • Hardware Virtual Machine (HVM): uses hardware extensions/QEMU to implement full virtualization
  • Paravirtualization (PV): is a technique in which the hypervisor provides an API and the OS of the guest virtual machine calls that API, it requires OS modifications.

Hypervisor is responsible for providing the abstraction layer between hardware and virtual instances. It is a software that governs the virtualization — it controls the hardware resources and allocates them as needed by each virtual instance in a synchronized manner.

Generally there are two types of hypervisors:

  1. Type-1: refers to as ‘Bare metal’ hypervisors (VMware ESXi, Xen)

They runs directly on the host’s HW to manage guest OS and does not require base server OS. They provide direct access to HW resources and usually have better performance, scalability and stability. However they have limited support for hardware.

2. Type-2 : is a ‘hosted’ hypervisor (VMware Desktop, VirtualBox)

Such hypervisors are hosted on main OS as a software installed on an OS. In this case the OS make the hardware calls, hence they provide better compatibility with HW. However they have increased overhead affecting performance, as OS involved.

The above classification is not necessarily accurate when speaking about modern virtualization systems. Further muddled by virtualization technologies such as KVM and bhyve, which are kernel modules that turn Linux and FreeBSD into hypervisors more similar to that of Type-1 system, yet are still running on a general purpose OS, which could theoretically shift their categorization to Type-2.

Types of Virtualization

In original x86 architecture (32 bits), CPUs had four protection rings (0 through 3). In x86_64, AMD removed two rings. Virtualization however needs three rings (User mode, Guest Kernel, Hypervisor) as we still need to protect the kernel from the user mode. This means separate address spaces. Every system call now has to bounce to the hypervisor, which results in a context-switch to the guest kernel. This context switch requires flushing the TLB CPU cache. When the TLB cache is empty, execution is slower until it fills up which impact CPU performance.

To assist virtualization, Intel VT and AMD-V inserted a new privileged pseudo ring-1 that the hypervisor lives in, allowing guest kernel to reside in ring 0 and userland to reside in ring 3. These extensions make virtualizing the CPU more efficient — emulation/binary translation for CPU not occurring thus making things faster.

XEN Hypervisor

XEN is a Type-1 hypervisor, functioning as a microkernel that enables running multiple instances of different OS in parallel on a single host. It uses a very thin, specialized hypervisor that performs only the core tasks such as partition isolation and memory management, without including an I/O stack or device drivers. The virtualization stack and hardware-specific device drivers are located in a specialized partition known as parent partition.

Xen has a small footprint and a minimal interface to the guest. It doesn’t run POSIX software and has its own system call API. (I/O virtualization is achieved using a customer guest OS device drivers).

Paravirtualization (PV) requires guest operating systems that are aware of XEN and have PV-enabled kernel with PV drivers. This awareness allows guest to interact with the hypervisor, running efficiently without the need for emulation or virtual emulated hardware. Paravirtualization implements the following functionalities:

  • Disk and Network drivers — PV net and block drivers
  • PV interrupt controllers and timers
  • No emulated hardware
  • Guest boots to the kernel without 16bit mode of BIOS
  • Privileged instructions are replaced with hypercalls (Hypercall is to a hypervisor what a syscall is to a kernel i.e. software trap from a Dom(0 or U) to the hypervisor)
  • Access to page tables is virtualized — Shadow Page Tables

Trap and emulate architecture.

Xen runs in ring 0, and guest OS in ring 1. Xen sits in the top 64MB of address space of guests. Guest OS traps to Xen to perform privilege operations.

XEN introduces following two main concepts:

  • Guest Domains (DomU) is a virtualized environments running it’s own OS and applications, with no privilege to access hardware or I/O functionality, therefore also call unprivileged domains.
  • Control Domain (Dom0) is a specialized Virtual machine that has special privileges like the capability to access the hardware directly, handles all access to the system’s I/O functions and interacts with the other Virtual Machines. Dom0 runs control/management software.
Xen Architecture

Privileged Domain (Dom0): This is the first domain (virtual machine) to start after the hypervisor. As the privileged (control) domain, Dom0 has direct access to the hardware and handles all access to system’s I/O functions. It interacts with other unprivileged domains (DomU’s). Dom0 is the source of physical device drivers offering native hardware support for Xen system. Dom0 also contains virtual device drivers (also called backends). Additionally, Dom0 performs functions such as:

  • Toolstack (TS) exposing a user interface to a Xen based system
  • XenStore/XenBus (XS) for managing settings
  • Device Emulation (DE) which is based on QEMU in XEN based systems

Toolstack

The Xen project software employs various toolstacks, each exposing an API that supports different set of tools or user-interface. Historically, the default Toolstack was Xend, but it was deprecated in Xen 4.1. Since then, the the default Toolstack has been xl, a lightweight command-line toolstack built using libxenlight.

Xen toolstack is used to managing guest lifecycles, as well as other things such as:

  • Creation, shutdown, reboot, pause, unpause, migration, and termination
  • Listing and details
  • Manages guest devices — network and block
  • Virtual framebuffer for keyboard and mouse
  • PCI, VGA, SCSI LUN, USB pass-through
  • Management of CPU resources between guests, including CPU pinning, CPU scheduling, and the CPUID features exposed to guests.

XenStore

XenStore is a shared information storage space between domains (Dom0 and DomU) managed by the Xenstored, which runs on Dom0. It functions as a filesystem-like database used by Xen applications and drivers to communicate and store configuration information. The database is managed by Dom0, and access is given to guest via shared memory. When a guest boots, the start info page contains the address of the shared memory used to communicate with Xenstore. All updates to XenStore are fully transactional and atomic, making it suitable for configuration and status information rather than for large data transfers.

Each domain gets its own path in the store, and the appropriate drivers are notified. Interdomain communication through this is very low level , using Virtual IRQs and shared memory segments. Xenstore is responsible for speeding up interdomain communication and storing relevant configuration. It uses virtual interrupts and share memory, and XenBus is used to communicate with XenStore, containing information about the status, memory, devices of DomU.

XenBus and Xenconsoled

Xenified Linux uses Xenbus to refer to the specific interface for accessing XenStore. Generally Xen uses XenBus to refer to the protocol used to connect to the device drivers in XenStore. XenBus provides a bus abstraction for paravirtualized (PV) drivers to communicate between domains. In practice, this bus is used for configuration negotiation, with most data transfer done via an interdomain channel consisting of a shared page and an event channel.

Xenconsoled: For PV guests, this serves as the ‘console’ to access the instance with keyboard/video/mouse. It can be found in XenStore at:

/local/domain/$DOMID/device/console/$DEVID

If QEMU is used instead console info is found in XenStore at

/local/domain/$DOMID/serial/$SERIAL_NUM/tty

CPU virtualization in Xen

The guest OS voluntarily invokes Xen to perform privileged operations , similar to how system calls are made from user process to the kernel. During this time guest pauses while Xen services the hypercall. The guest OS code is modified to not invoke any privileged instructions, any attempt to perform privileged operations traps to Xen in Ring 0.

Communication from Xen to Dom is much like interrupts from hardware to kernel. This mechanism is used to deliver hardware interrupts and other notifications to domain. The domain registers event handler callback functions.

Trap handling in Xen

When a trap/interrupt occurs, Xen copies the trap frame onto the guest OS kernel stack and invokes guest interrupt handler. The guest registers an interrupt descriptor table with Xen to handle traps. Xen validates these Interrupt handlers to ensure that no privileges segments are loaded. The guest trap handlers work based on the information on kernel stack, requiring no modifications to the guest OS, except for the page fault handler. The page fault handler needs to read the CR2 register to find faulting address, a privileged operation. Therefore, the page fault handler is modified to read faulting address from kernel stack, where Xen places it.

If an interrupt handler still invokes privilege operation, it traps to Xen again. Xen detects this “double fault” (a trap followed by another trap from interrupt handler code) and terminates the misbehaving guest.

Memory Virtualization in Xen

One copy of the combined GVA -> HPA page table is maintained by the guest OS, with CR3 pointing to this page table. Unlike shadow page tables, this page table resides in the guest’s memory rather than in the VMM.

Guest is given read-only access to guest “RAM” mappings (GPA->HPA). Using this, guest can construct combined GVA->GPA mapping.

The guest page table, located in guest memory, is validated by Xen. The guest OS marks its page table pages as read-only, preventing direct modification. When updates are necessary, the guest makes a hypercall to Xen to request updates to the page table. Xen validates these updates, ensuring the guest is only accessing its allocated slice of RAM, and applies them. For enhanced performance, updates are batched.

Segment descriptor tables are also maintained similarly. Read-only copy in guest memory, updates validated and applied by Xen. Segments are truncated to exclude top 64MB occupied by Xen.

I/O Virtualization in Xen

Device Emulation Based I/O in Xen involves two primary models:

  1. In PV split driver model, Xen uses a front-end driver on DomU to send network or block I/O to back-end driver on Dom0. The back end driver then communicated with the actual device driver to talk to physical device. This setup minimizes performance overhead compared to full device emulation with QEMU.
  2. The Passthrough model implements PCI passthrough, which assigns a PCI device (NIC, disk controller, HBA, USB controller, firewall controller, sound card, etc.) directly to a virtual machine guest. This provides the guest with full and direct access to the PCI device. Xen supports a number of flavors of PCI passthrough, including VT-d passthrough and SR-IOV.
PV Drivers & SR-IOV

(Traditional emulation typically involves binary translation, in which a software-based process within the VMM traps hardware calls from guest OS and translates them to make them compatible with the host OS. That translation requires computation that can introduce substantial processing overhead and decrease the overall performance and scalability of the environment. Paravirtualization removes the need for binary translation by building a software interface into the VMM that presents the virtual machines (VMs) with appropriate drivers and other elements that take the place of the dynamically emulated hardware. While paravirtualization typically requires modification of guest operating systems, Intel VT enables Xen, VMware, and other virtualization environments to run many unmodified guest OSs.)

PV Drivers

In Xen’s paravirtualized (PV) environment, a shared I/O ring mechanism facilitates communication between frontend and backend drivers. The frontend driver resides in the guest domain, while the backend driver operates in Dom0.

Here’s how it works: I/O requests initiated by the guest domain are placed into a shared queue. Xen or Dom0 processes these requests, and the corresponding responses are placed back into the shared ring. Descriptors within the queue include pointers to request data, such as DMA buffers containing data for write operations or empty DMA buffers for read operations. This setup efficiently manages data transfer between the guest and Dom0, leveraging shared memory and ring buffers for streamlined communication.

This model has a higher packet processing overhead than SR-IOV. Shared memory rings alone would require a lot of polling, which is not always particularly efficient, although it can be fast where there is pending data in a large percentage of the polled cases. This need is eliminated by the Xen event mechanism, which allows asynchronous notifications. This is used to tell the back end that a request is waiting to be processed, or to tell a front end that there is a response waiting.

Summary

Xen is a paravirtualization-based hypervisor where the guest operating system is modified to support virtualization. It operates on a trap-and-emulate mechanism through the Virtual Machine Monitor (VMM). In this setup, the guest operates in Ring 1 while the VMM runs in Ring 0. For privileged operations, the guest traps to the VMM. Page tables combine Guest Virtual Addresses (GVA) to Host Physical Addresses (HPA) within the guest memory, maintaining a read-only copy in the guest. Updates to these tables are made via hypercalls to Xen. I/O operations are managed through shared rings between guests and Xen/Dom0.

--

--