Articles on Distributed systems

Last updated: 2022/11/28

Top deep-dives on Distributed systems

ZooKeeper: Wait-free coordination for Internet-scale systems [pdf]

Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed present "a [replicable and centralized] service for coordinating processes of distributed applications" that "incorporates elements from group messaging, shared registers, and distributed lock services".

Database optimisation, analytics and burnout

Mark Veidemanis explains his work on Pathogen, a data analytics pipeline. Mark also mentions Sandstorm, which is a cool open source platform for self-hosting utility apps.
Some highlights:

  • Concurrency and threads are hard
  • Finding the correct tool/library is hard
  • "There's always another millisecond to shave off the execution time, but how long are you going to spend doing it?"

The Internet Was Designed With a Narrow Waist

Andy Chu explores how the narrow waist concept aims to reduce interoperability issues (using the internet as an example), and elaborates on why it should be applied to the shell and distributed systems.

Resiliency in Distributed Systems

Gergely Orosz discusses two chapters on resiliency from the book "Understanding Distributed Systems" by Roberto Vitillo. The book covers topics such as communication, coordination, scalability, resiliency, and maintainability in large-scale systems.
Some highlights:

  • Ideally, a network call should be wrapped within a library function that sets a timeout and monitors the request so that we don’t have to remember to do this for each call
  • Retrying needs to be slowed down with increasingly longer delays between the individual retries until either a maximum number of retries is reached or enough time has passed since the initial request (can use exponential backoff)
  • Circuit breakers are mechanism that detect long-term degradations of downstream dependencies and stop new requests from being sent downstream in the first place
  • Load shedding is when a server fails new requests quickly to prevent overloading
  • Load leveling is when you transform time-variable inbound requests to a consistent flow to the service using something like a queue

Replicated Log

"When multiple nodes share a state, the state needs to be synchronized". Unmesh Joshi's article demonstrates how this can be achieved by "using a write-ahead log that is replicated to all the cluster nodes".

Everything you want to know about the CAP theorem

"CAP in CAP theorem (or Brewer's theorem) stands for Consistency, Availability and Partition tolerance.". Vaibhav Rabber describes the theorem and applies to examples.
Some highlights:

  • Designing distributed systems, like most engineering, requires considering tradeoffs between, mostly between availability and consistency

Clock-Bound Wait

Unmesh Joshi provides a solution to keeping clocks in sync for distributed key value stores where versioning is based off of timestamps.

System Design: Domain Name System (DNS), Load Balancing & Clustering.

Nandan Kumar does a deep dive into the titular topics, including what, how, why, and examples.
Some highlights:

  • DNS is a decentralized naming system that translates human-understandable domain names to machine-understandable Internet Protocol addresses
  • Load balancing lets us distribute incoming network traffic across multiple resources, ensuring high availability and reliability by sending requests only to resources that are available and running
  • A computer cluster is a group of two or more computers, or nodes, that run in parallel to achieve a common goal

Fixed Partitions

A fairly complicated problem with distributed systems is keeping track of where all of the data is stored, while also keeping things performant. This becomes an even bigger issue when there's the potential for a lot of data to be added in the system's lifetime. Unmesh Joshi presents a method of setting up a distributed system with data mapped to node (logical) partitions, keeping the necessary copying of data to a minimum when new nodes are added.
Some key takeaways are:

  • Look up keys are the hash of the data item key and its partition
  • Instantiate the system with a large number of partitions (3-10x more than the number of nodes)
  • Data storage/retrieval is a two step process: find partition of data item -> find cluster node storing partition
  • Key hash function should be runtime independent (MD5 or Murmur)
  • Redistributing data among the nodes becomes a matter of just copying relatively small partitions
  • Use a "dedicated Consistent Core as a coordinator which keeps track of all nodes in the cluster and maps partitions to nodes"
  • Meta data should be kept on all cluster nodes so the Consistent Core controller isn't the bottleneck for querying

Migrating to a Multi-Cluster Managed Kafka with 0 Downtime

Natan Silnitsky discusses how his team migrated Wix’s 2000 microservices from self-hosted Kafka clusters to a multi-cluster managed cloud platform. Natan includes key design decisions, best practices and tips.
Some highlights:

  • Wix had dramatic increases in the number of topics, partitions, and records being generated (~400-500%) within a year
  • Split Kafka clusters by different service-level agreements
  • Migration was done with 0 down-time


Want to see more in-depth content?

subscribe to my newsletter!

Other Articles