In Search of an Understandable Consensus Algorithm
Diego Ongaro and John Ousterhout • USENIX ATC 2014
The Raft consensus algorithm - a more understandable alternative to Paxos
A curated collection of research papers I find fascinating and insightful. Each paper has shaped my understanding of various technical domains.
Inspired by Arpit Bhayani's paper collection
Papers about distributed systems, consensus algorithms, and scalable architectures
Diego Ongaro and John Ousterhout • USENIX ATC 2014
The Raft consensus algorithm - a more understandable alternative to Paxos
Jeffrey Dean and Sanjay Ghemawat • OSDI 2004
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung • SOSP 2003
Foundational paper describing GFS - Google's distributed file system
Giuseppe DeCandia, et al. • SOSP 2007
Introduces the concept of eventual consistency and influenced many NoSQL databases
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed • USENIX ATC 2010
Describes the coordination service used by many distributed systems
James C. Corbett, et al. • OSDI 2012
First globally-distributed database with external consistency guarantees
Jeff Terrace and Michael J. Freedman • USENIX ATC 2009
Extends chain replication with apportioned queries for improved read throughput while maintaining strong consistency
Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel • OSDI 2010
Describes Facebook's efficient photo storage system that handles billions of uploads
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica • NSDI 2012
Introduces RDDs (Resilient Distributed Datasets) - the core abstraction in Apache Spark that enables efficient fault-tolerant distributed computation
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica • SOSP 2013
Introduces Spark Streaming's micro-batch architecture for scalable stream processing
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, et al. • OSDI 2016
Describes the architecture and implementation of TensorFlow, Google's distributed system for training and serving machine learning models
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, et al. • OSDI 2018
Introduces Ray, a distributed system designed for emerging AI applications requiring flexible distributed computation
Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski • INRIA TR 2011
Introduces CRDTs (Conflict-free Replicated Data Types) that guarantee eventual consistency in distributed systems without coordination
Papers about Serverless computing aka Function-As-S-Service(FAAS)
Marc Brooker, Mike Danilov, Chris Greenwood, and Phil Piwonka • USENIX ATC 2023
Alexandru Agache, Marc Brooker, Andreea Florescu, et al. • NSDI 2020
Describes AWS's innovative microVM technology that powers Lambda and Fargate, focusing on fast startup times and security isolation