Why DS are challenging

Distributed systems are non-deterministic.

Unlike deterministic systems (same input same output), DS involve multiple independent components communicating over unreliable networks.

This means:

  • Network delays and faults are inevitable
  • Not all system components are reliable
  • Concurrency adds unpredictability

Despite this, reliable systems can be built from unreliable components through careful design of communication patterns (e.g. nuclear power systems).

Problems

1. Network issues

The Problem: Networks are unreliable.

There is no protocol that can guarantee information exchange under all conditions (Two Generals Problem).

Solutions:

  • Monitoring
  • Health checks and heartbeat
  • Timeout mechanisms
  • Retry logic

Idempotent operations - operations that produce the same result when repeated multiple times. Example: HTTP DELETE requests. Safe retries without duplicating state.

2. Trust issues

The Problem: Nodes may fail or misbehave. How do we ensure all nodes agree on system state? (N-Generals Problem)

Consensus: Every node must reach agreement on what has happened in the system.

This is harder than network reliability because some participants may be adversaries (baddies, he said)

Common failure scenario: Leader-based replication

  • Leader is declared dead due to network issues
  • New leader is elected
  • Now you have two leaders with replicated data. Now we have data corruption, first leader is the baddie.

Solutions:

  • Leaderless replication
  • Consensus algorithms

3. Time & Ordering Issues

The Problem: Clocks across different machines are not synchronized.

Ordering events correctly is important for cases like:

  • Auction systems (eBay)
  • Banking transactions
  • Financial records

Clock sync approaches:

  • NTP (Network Time Protocol) - nodes fetch a central time source
  • Atomic clocks - most accurate. Measures time by monitoring the resonant frequency of atoms. It is based on the fact that atoms have quantised energy levels, and transitions between such levels are driven by very specific frequencies of electromagnetic radiation.
  • Time-of-day clocks - system clocks (can drift, can be adjusted)
  • Monotonic clocks - track ordering within a single machine. Always increases, never goes backward. First date, 1 Jan 1970. Example: Unix time on Linux

For distributed ordering:

  • Lamport clocks - logical clocks that provide causal ordering across machines

    • Leslie Lamport also created LaTeX
  • Vector clocks - track causality more precisely than Lamport clocks; useful for detecting concurrent events.

A vector clock is a mapping where the fields represent the parties that have proposed changes, known as “actors”.

  • When a change occurs, a new vector clock is created that acts as a successor to all previously seen vector clocks.
  • If two vector clocks are compared and neither “descends” from the other, they are considered to be in conflict

When using client identifiers, the vector clocks will grow proportionally with the number of clients over time, potentially becoming very large. To control the unbounded growth, vector clocks are often “pruned”.

TIP

Watch Tom Scott’s video on timezones