Paxos Made Live: Bridging Theory and Production Systems

This post is adapted from a paper review I wrote during CMU’s 18-845: Internet Services course, where we studied seminal papers in distributed systems, web architecture, and virtualization. I’ve converted my reviews into blog posts to share them more broadly.

Paper: “Paxos Made Live - An Engineering Perspective” by Tushar Chandra, Robert Griesemer, and Joshua Redstone (PODC, 2007)

TL;DR

This paper chronicles the hard-won engineering lessons from building a production fault-tolerant database using the Paxos consensus algorithm inside Google’s Chubby lock service. The key takeaway is that the gap between a theoretical algorithm and a working production system is enormous – requiring solutions for state management, testing, performance, and numerous edge cases that the original Paxos paper never addressed.

Summary

The paper describes the authors’ experience building a fault-tolerant database system using the Paxos consensus algorithm. Paxos is employed in Chubby – a fault-tolerant system at Google that provides a distributed locking mechanism and stores small files. Specifically, Paxos is used to achieve fault tolerance through replication within Chubby. The paper discusses the engineering challenges, design decisions, implementation details, testing strategies, and evaluation of the complete system.

The authors ultimately conclude that converting pseudocode into a working production system in the domain of fault-tolerant computing is indeed a formidable challenge.

Strengths

The authors provide multiple concrete reasons why building a production system based on a research paper is difficult. These reasons help the reader appreciate the need for a paper like this that documents the engineering perspective.
The focus on testing is excellent. The testing framework is impressive, as it takes careful design and significant effort to create test cases for such a large distributed fault-tolerant system.
The clean separation of the design into three main parts – Chubby, the fault-tolerant database, and fault-tolerant logging – is well-motivated. This separation means each component can be reused in other systems.

Weaknesses

The paper focuses more on design and implementation challenges than on Paxos itself. A title like “Replication Using Fault Tolerance” might have been more accurate, though less recognizable.
The catch-up mechanism for lagging replicas needs more detail. The authors are not sufficiently clear on how this is achieved in Multi-Paxos.
There is very little information about the design of the locking mechanism. Given that the system relies heavily on multi-threading, more detail on the concurrency control aspects would have been valuable.
The term “machine years” is used without a footnote explaining what it means, which could confuse readers unfamiliar with the term.

Discussion Questions

Who maintains the lease on the replicas? Does the master control the lease? What happens when a replica holding a lease goes down and comes back up? Does it request a lease renewal? How does the renewal process work, and what is the time granularity?
Regarding group memberships: can there be multiple groups, and if so, can a replica belong to multiple groups simultaneously?
Since each snapshot is taken independently at each replica, can a replica end up replaying a snapshot from another replica even when some of the operations in that snapshot are already recorded locally? For example, if replica A has records up to sequence 9 and replica B has a snapshot containing sequences 5 through 15, will replica A replay sequences 5 through 15 from B’s snapshot, or only those after 9? What happens when a replica fails while replaying a snapshot? How does it roll back?
Does the database described in the paper follow the ACID properties (Atomicity, Consistency, Isolation, and Durability)?
What is the memory overhead of maintaining an additional table for checksums? Is there a mechanism to validate the consistency of the checksum table itself?