Paxos Made Live: Bridging Theory and Production Systems
Paxos Made Live: Bridging Theory and Production Systems
Paper: “Paxos Made Live - An Engineering Perspective” by Tushar Chandra, Robert Griesemer, and Joshua Redstone (PODC, 2007)
TL;DR
This paper chronicles the hard-won engineering lessons from building a production fault-tolerant database using the Paxos consensus algorithm inside Google’s Chubby lock service. The key takeaway is that the gap between a theoretical algorithm and a working production system is enormous – requiring solutions for state management, testing, performance, and numerous edge cases that the original Paxos paper never addressed.
Summary
The paper describes the authors’ experience building a fault-tolerant database system using the Paxos consensus algorithm. Paxos is employed in Chubby – a fault-tolerant system at Google that provides a distributed locking mechanism and stores small files. Specifically, Paxos is used to achieve fault tolerance through replication within Chubby. The paper discusses the engineering challenges, design decisions, implementation details, testing strategies, and evaluation of the complete system.
The authors ultimately conclude that converting pseudocode into a working production system in the domain of fault-tolerant computing is indeed a formidable challenge.
Strengths
- The authors provide multiple concrete reasons why building a production system based on a research paper is difficult. These reasons help the reader appreciate the need for a paper like this that documents the engineering perspective.
- The focus on testing is excellent. The testing framework is impressive, as it takes careful design and significant effort to create test cases for such a large distributed fault-tolerant system.
- The clean separation of the design into three main parts – Chubby, the fault-tolerant database, and fault-tolerant logging – is well-motivated. This separation means each component can be reused in other systems.
Weaknesses
- The paper focuses more on design and implementation challenges than on Paxos itself. A title like “Replication Using Fault Tolerance” might have been more accurate, though less recognizable.
- The catch-up mechanism for lagging replicas needs more detail. The authors are not sufficiently clear on how this is achieved in Multi-Paxos.
- There is very little information about the design of the locking mechanism. Given that the system relies heavily on multi-threading, more detail on the concurrency control aspects would have been valuable.
- The term “machine years” is used without a footnote explaining what it means, which could confuse readers unfamiliar with the term.
Discussion Questions
- Who maintains the lease on the replicas? Does the master control the lease? What happens when a replica holding a lease goes down and comes back up? Does it request a lease renewal? How does the renewal process work, and what is the time granularity?
- Regarding group memberships: can there be multiple groups, and if so, can a replica belong to multiple groups simultaneously?
- Since each snapshot is taken independently at each replica, can a replica end up replaying a snapshot from another replica even when some of the operations in that snapshot are already recorded locally? For example, if replica A has records up to sequence 9 and replica B has a snapshot containing sequences 5 through 15, will replica A replay sequences 5 through 15 from B’s snapshot, or only those after 9? What happens when a replica fails while replaying a snapshot? How does it roll back?
- Does the database described in the paper follow the ACID properties (Atomicity, Consistency, Isolation, and Durability)?
- What is the memory overhead of maintaining an additional table for checksums? Is there a mechanism to validate the consistency of the checksum table itself?
This review was written as part of CMU’s 18-845: Internet Services course.