Articles on Distributed systems

Last updated: 2023/01/25

Top deep-dives on Distributed systems

ZooKeeper: Wait-free coordination for Internet-scale systems [pdf]

Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed present "a [replicable and centralized] service for coordinating processes of distributed applications" that "incorporates elements from group messaging, shared registers, and distributed lock services".

The Internet Was Designed With a Narrow Waist

Andy Chu explores how the narrow waist concept aims to reduce interoperability issues (using the internet as an example), and elaborates on why it should be applied to the shell and distributed systems.

Distributed System Models in the Real World

Kevin Sookocheff discusses the different components of a system model for a distributed algorithm. These components are processes, communication links, and timing assumptions. Kevin explains how these components can be varied to represent different operating environments, then highlights the benefits of working with a system model.

Resiliency in Distributed Systems

Gergely Orosz discusses two chapters on resiliency from the book "Understanding Distributed Systems" by Roberto Vitillo. The book covers topics such as communication, coordination, scalability, resiliency, and maintainability in large-scale systems.
Some highlights:

  • Ideally, a network call should be wrapped within a library function that sets a timeout and monitors the request so that we don’t have to remember to do this for each call
  • Retrying needs to be slowed down with increasingly longer delays between the individual retries until either a maximum number of retries is reached or enough time has passed since the initial request (can use exponential backoff)
  • Circuit breakers are mechanism that detect long-term degradations of downstream dependencies and stop new requests from being sent downstream in the first place
  • Load shedding is when a server fails new requests quickly to prevent overloading
  • Load leveling is when you transform time-variable inbound requests to a consistent flow to the service using something like a queue

Replicated Log

"When multiple nodes share a state, the state needs to be synchronized". Unmesh Joshi's article demonstrates how this can be achieved by "using a write-ahead log that is replicated to all the cluster nodes".

Everything you want to know about the CAP theorem

"CAP in CAP theorem (or Brewer's theorem) stands for Consistency, Availability and Partition tolerance.". Vaibhav Rabber describes the theorem and applies to examples.
Some highlights:

  • Designing distributed systems, like most engineering, requires considering tradeoffs between, mostly between availability and consistency

Clock-Bound Wait

Unmesh Joshi provides a solution to keeping clocks in sync for distributed key value stores where versioning is based off of timestamps.

Guide to Projections and Read Models in Event Driven Architecture

Oskar Dudycz presents how projections can be used in event sourcing.
Some highlights:

  • Event sourcing is when you store events and read models in the same relational database
  • Projections can be used to interpret the events and read models
  • Projections can be used to find gaps, build a recommendation engine, and more

Database optimisation, analytics and burnout

Mark Veidemanis explains his work on Pathogen, a data analytics pipeline. Mark also mentions Sandstorm, which is a cool open source platform for self-hosting utility apps.
Some highlights:

  • Concurrency and threads are hard
  • Finding the correct tool/library is hard
  • "There's always another millisecond to shave off the execution time, but how long are you going to spend doing it?"

Fixed Partitions

A fairly complicated problem with distributed systems is keeping track of where all of the data is stored, while also keeping things performant. This becomes an even bigger issue when there's the potential for a lot of data to be added in the system's lifetime. Unmesh Joshi presents a method of setting up a distributed system with data mapped to node (logical) partitions, keeping the necessary copying of data to a minimum when new nodes are added.
Some key takeaways are:

  • Look up keys are the hash of the data item key and its partition
  • Instantiate the system with a large number of partitions (3-10x more than the number of nodes)
  • Data storage/retrieval is a two step process: find partition of data item -> find cluster node storing partition
  • Key hash function should be runtime independent (MD5 or Murmur)
  • Redistributing data among the nodes becomes a matter of just copying relatively small partitions
  • Use a "dedicated Consistent Core as a coordinator which keeps track of all nodes in the cluster and maps partitions to nodes"
  • Meta data should be kept on all cluster nodes so the Consistent Core controller isn't the bottleneck for querying

Troubleshooting Kafka for 2000 Microservices at Wix

Natan Silnitsky underlines three must-have features and two remediation tools for troubleshooting and fixing event-streaming related production issues for microservices using Kafka.
Some highlights:

  • Features are: trace events flow, easily lookup a specific event payload, investigate “Slow” Consumers root cause, easy events skip/replay, and redistribution of single partition lags
  • Trace, Lookup, Longest-Running, Skip, Redistribute, form the Acronym TLLSR
  • Distributed systems are hard


Want to see more in-depth content?

subscribe to my newsletter!

Other Articles