Live Migration of Virtual Machines: Seamless OS Migration with Minimal Downtime

Paper: “Live Migration of Virtual Machines” by Christopher Clark et al. (NSDI, 2005)

TL;DR

This paper presents a technique for migrating running virtual machines across physical hosts with remarkably low downtime – as little as 60 milliseconds. Using a pre-copy approach that iteratively transfers memory pages while the VM continues running, combined with dynamic rate limiting and Xen’s shadow page tables for dirty-page tracking, the system enables transparent fault management, load balancing, and system maintenance in data center environments.

Summary

The paper discusses live migration – moving operating system instances across distant physical hosts – and explains why it is valuable for fault management, load balancing, and low-level system maintenance. It describes how live migration works, focusing on memory migration using a pre-copy technique that transfers memory pages from the source to the destination before the final switchover, as well as how local resources are migrated.

The paper explains the distinction between managed migration and self-migration, then provides a thorough evaluation of OS migration built on top of the Xen VMM. For a well-connected data center or cluster with network-accessed storage, the authors achieved downtime as low as 60 milliseconds for the migration of entire OS instances on commodity hardware. The live migration technique is remarkably powerful and has become essential in modern cloud computing scenarios.

Strengths

Using Xen’s shadow page tables to track dirtying statistics across all pages used by the OS was a clever approach. Rather than building a new tracking mechanism, the authors reused existing infrastructure to monitor writable working sets.
The dynamic rate-limiting technique is well-designed: it avoids saturating the network during the initial transfer of large memory regions, while still allowing maximum bandwidth during the final rounds to minimize stop-and-copy time.
The evaluation is strong. Testing and presenting graphs across different workloads effectively demonstrates the low overhead of live migration.

Weaknesses

The paper should have explored the after-effects of live migration in more technical depth. It mentions that the VM resumes at the destination and stops at the source, but more information about the challenges of adjusting a migrated VM to its new physical host would have been valuable.
In Section 5.5 on stunning rogue processes, the authors note that one must be careful not to stun important interactive services. However, they provide no criteria for classifying a service as “important.” Figure 7 shows the improvement from this technique, but no data is given about which services were stunned and which were not.
The results are for a 1 Gbps LAN, but the paper includes no experiments on what happens during live migration under adverse network conditions. Data on latency and behavior with low bandwidth, lossy networks, or high data transfer volumes would have strengthened the evaluation.

Discussion Questions

Is live migration possible between physical machines with different ISAs or different hardware configurations (RAM, local disk, network cards, I/O devices)? If Xen is installed on one machine and VMware on another, can self-migration work across them? Must both participating machines be identical?
How is the notion of time maintained between two physical machines? If one machine runs at 1 GHz and the other at 1.5 GHz, the time granularity changes, which may affect timer interrupt frequency. How is this handled during live migration?
Are there any guarantees on CPU time after migration? If a VM is the sole occupant of machine A and receives full CPU utilization, but after migration to machine B it must coexist with three other VMs, will the user experience noticeable delay or slowness?
What is the process for IP address management after live migration when multiple VMs run on the destination machine? Does the destination machine’s IP address change? If so, what are the consequences for VMs already running there?
To log dirtied pages, Xen inserts shadow page tables underneath the running OS. With newer hardware virtualization support and Extended Page Tables (EPT), can the shadow page table method be improved? Could EPT be leveraged to make dirty-page tracking faster?
The rate-limiting technique uses minimum bandwidth to copy memory pages in the first round. If it takes 60 seconds at minimum speed but only 10 seconds at maximum bandwidth, would the latter be preferable? Faster initial transfer would reduce the total number of dirtied pages and subsequent rounds of copying.

This review was written as part of CMU’s 18-845: Internet Services course.