Articles on Distributed systems
Last updated: 2023/01/25
Top deep-dives on Distributed systems
ZooKeeper: Wait-free coordination for Internet-scale systems [pdf]
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed present "a [replicable and centralized] service for coordinating processes of distributed applications" that "incorporates elements from group messaging, shared registers, and distributed lock services".
The Internet Was Designed With a Narrow Waist
Andy Chu explores how the narrow waist concept aims to reduce interoperability issues (using the internet as an example), and elaborates on why it should be applied to the shell and distributed systems.
Distributed System Models in the Real World
Kevin Sookocheff discusses the different components of a system model for a distributed algorithm. These components are processes, communication links, and timing assumptions. Kevin explains how these components can be varied to represent different operating environments, then highlights the benefits of working with a system model.
Resiliency in Distributed Systems
Gergely Orosz discusses two chapters on resiliency from the book "Understanding Distributed Systems" by Roberto Vitillo. The book covers topics such as communication, coordination, scalability, resiliency, and maintainability in large-scale systems.
- Ideally, a network call should be wrapped within a library function that sets a timeout and monitors the request so that we don’t have to remember to do this for each call
- Retrying needs to be slowed down with increasingly longer delays between the individual retries until either a maximum number of retries is reached or enough time has passed since the initial request (can use exponential backoff)
- Circuit breakers are mechanism that detect long-term degradations of downstream dependencies and stop new requests from being sent downstream in the first place
- Load shedding is when a server fails new requests quickly to prevent overloading
- Load leveling is when you transform time-variable inbound requests to a consistent flow to the service using something like a queue
"When multiple nodes share a state, the state needs to be synchronized". Unmesh Joshi's article demonstrates how this can be achieved by "using a write-ahead log that is replicated to all the cluster nodes".
Everything you want to know about the CAP theorem
"CAP in CAP theorem (or Brewer's theorem) stands for Consistency, Availability and Partition tolerance.". Vaibhav Rabber describes the theorem and applies to examples.
- Designing distributed systems, like most engineering, requires considering tradeoffs between, mostly between availability and consistency
Unmesh Joshi provides a solution to keeping clocks in sync for distributed key value stores where versioning is based off of timestamps.
Guide to Projections and Read Models in Event Driven Architecture
Oskar Dudycz presents how projections can be used in event sourcing.
- Event sourcing is when you store events and read models in the same relational database
- Projections can be used to interpret the events and read models
- Projections can be used to find gaps, build a recommendation engine, and more
Database optimisation, analytics and burnout
Mark Veidemanis explains his work on Pathogen, a data analytics pipeline. Mark also mentions Sandstorm, which is a cool open source platform for self-hosting utility apps.
- Concurrency and threads are hard
- Finding the correct tool/library is hard
- "There's always another millisecond to shave off the execution time, but how long are you going to spend doing it?"
A fairly complicated problem with distributed systems is keeping track of where all of the data is stored, while also keeping things performant. This becomes an even bigger issue when there's the potential for a lot of data to be added in the system's lifetime. Unmesh Joshi presents a method of setting up a distributed system with data mapped to node (logical) partitions, keeping the necessary copying of data to a minimum when new nodes are added.
Some key takeaways are:
- Look up keys are the hash of the data item key and its partition
- Instantiate the system with a large number of partitions (3-10x more than the number of nodes)
- Data storage/retrieval is a two step process: find partition of data item -> find cluster node storing partition
- Key hash function should be runtime independent (MD5 or Murmur)
- Redistributing data among the nodes becomes a matter of just copying relatively small partitions
- Use a "dedicated Consistent Core as a coordinator which keeps track of all nodes in the cluster and maps partitions to nodes"
- Meta data should be kept on all cluster nodes so the Consistent Core controller isn't the bottleneck for querying
Troubleshooting Kafka for 2000 Microservices at Wix
Natan Silnitsky underlines three must-have features and two remediation tools for troubleshooting and fixing event-streaming related production issues for microservices using Kafka.
- Features are: trace events flow, easily lookup a specific event payload, investigate “Slow” Consumers root cause, easy events skip/replay, and redistribution of single partition lags
- Trace, Lookup, Longest-Running, Skip, Redistribute, form the Acronym TLLSR
- Distributed systems are hard