Introduction

Distributed systems can be classified into three main categories based on how data is processed and the timing of that processing.

The classification depends on:

  • Latency requirements
  • Input patterns - Is data event-driven or scheduled?
  • Processing model - Synchronous interaction vs. asynchronous batch processing

Type 1: Online Systems

Systems designed to respond to user queries in real-time with synchronous request-response patterns.

Characteristics

  • Latency: Low latency required (milliseconds to seconds)
  • Throughput: Variable, depends on user load
  • Concurrency: High - many simultaneous users
  • Input: User-initiated requests (synchronous)
  • Coupling: Tightly coupled request-response

Examples

  • Web servers and APIs (REST, GraphQL)
  • E-commerce platforms
  • Real-time collaboration tools
  • Chat applications

Key Considerations

  • Responsiveness: Must handle user expectations for fast responses
  • Scalability: Need to handle concurrent users (see partitioning for horizontal scaling)
  • Reliability: User-facing, so failures are immediately visible
  • Network resilience: See ds-problems-pad-5nov for handling network delays and failures

Common Patterns

  • Request-response RPC (Remote Procedure Call)
  • REST API design
  • Load balancing across multiple instances

Type 2: Offline Systems

Systems that process large batches of data on a schedule or on-demand, without direct user interaction.

Characteristics

  • Latency: High latency acceptable (minutes to hours)
  • Throughput: High - process large volumes efficiently
  • Concurrency: N/A - single or few batch jobs at a time
  • Input: Scheduled data (does not depend on online user activity)
  • Coupling: Loosely coupled, fire-and-forget

Examples

  • Data warehouse ETL (Extract, Transform, Load)
  • Analytics pipelines and reporting
  • Machine learning model training
  • Data aggregation and cleaning jobs
  • Scheduled backups and maintenance

Key Considerations

  • Throughput optimization: Maximize processing efficiency
  • Resource efficiency: Use resources effectively since jobs run infrequently
  • Data consistency: Can enforce strong guarantees since no concurrent access
  • Data partitioning: See partitioning for distributing large datasets across nodes

Common Patterns

  • MapReduce and batch processing frameworks
  • Scheduled cron jobs
  • Data lakes and warehouses
  • Eventual consistency models

Type 3: Near Real-time Systems

Systems that process continuous streams of data with low-to-moderate latency requirements, often event-driven.

Characteristics

  • Latency: Moderate (seconds to minutes)
  • Throughput: Continuous, unbounded stream
  • Concurrency: Multiple events processing in parallel
  • Input: Continuous stream of events
  • Coupling: Loosely coupled via message brokers

Examples

  • IoT sensor data processing
  • Video stream processing and analytics
  • Real-time event aggregation (clickstreams, logs)
  • Fraud detection systems
  • Real-time dashboards and monitoring

Key Considerations

  • Ordering guarantees: May need to preserve event causality
  • Backpressure handling: What happens when events arrive faster than processing?
  • Fault tolerance: Missed events can cause data loss
  • Exactly-once semantics: Balance between at-least-once and at-most-once delivery

Message Brokers: The Core of Near Real-time Systems

Purpose

intermediaries that manage asynchronous communication between producers (senders) and consumers (receivers). They decouple components so they don’t need direct knowledge of each other.

Architecture Pattern: Pub/Sub

Publisher-Subscriber: Publishers emit messages; brokers route them to interested subscribers.

Benefits:

  • Decoupling: Publishers don’t know about subscribers
  • Scalability: Add new subscribers without changing publishers
  • Flexibility: Subscribers can process messages at their own pace

Message Broker Types

1. Traditional Message Brokers (RabbitMQ)

Architecture: Queue-based

  • Messages are stored in queues
  • Consumer pulls (or broker pushes) messages
  • Message is deleted after consumer acknowledges

How it works:

  1. Producer sends message to queue
  2. Consumer fetches message from queue
  3. Consumer processes and acknowledges
  4. Message is removed from queue

Characteristics:

  • Built-in acknowledgment mechanisms
  • Support for explicit message routing
  • Generally lower storage overhead
  • Good for short-lived, transient messages

Pros:

  • Simple, well-understood model
  • Good acknowledgment guarantees
  • Efficient for low-latency requirements

Cons:

  • Cannot replay past messages (not persisted long-term)
  • Message loss if broker crashes before acknowledgment
  • Less suitable for long-term event streaming

2. Log-based Message Brokers (Kafka)

Architecture: Append-only log

  • Messages are written to an immutable log
  • Consumers maintain offset pointers to track their position
  • Messages persist indefinitely (or until retention period expires)

How it works:

  1. Producer appends message to log
  2. Consumer reads from their last offset
  3. Consumer processes message
  4. Consumer commits their offset (persists progress)
  5. Message remains in log for other consumers or replay

Characteristics:

  • All messages permanently recorded (until deletion)
  • Consumers track their own progress via offsets
  • Multiple consumers can read same message independently
  • Supports message replay

Pros:

  • Replay messages from any point in time
  • Multiple independent consumers
  • Durable, fault-tolerant persistence
  • Better for long-term event sourcing and analytics

Cons:

  • Higher storage overhead (keeping all messages)
  • More complex offset management
  • Slightly higher latency due to persistent writes

Comparison: Traditional vs. Log-based

AspectTraditional (RabbitMQ)Log-based (Kafka)
PersistenceTransient (deleted after consumption)Permanent (append-only log)
ConsumptionMessage deleted after ackMultiple independent consumers
ReplayNot possibleSupported via offset
Use CaseTask queues, command deliveryEvent streaming, analytics
StorageLowerHigher
SemanticsAt-most-once to exactly-onceAt-least-once to exactly-once

Message Broker Enhancements

1. Backpressure

Problem: Producer generates messages faster than consumer can process them. Queue grows unboundedly → memory exhaustion → system crash.

Definition: Mechanism to slow down producers when consumers fall behind.

Implementation patterns:

  • Reactive backpressure: Consumer signals to broker when overwhelmed
  • Rate limiting: Producer throttles message production
  • Queue depth monitoring: Stop accepting messages when queue exceeds threshold

Best practice: Design systems to handle backpressure gracefully rather than just dropping messages.


2. Load Balancing

Purpose: Distribute work across multiple consumer instances.

Strategies:

  • Round-robin: Send messages to consumers in sequence
  • Least-loaded: Send to consumer with fewest pending messages
  • Sticky routing: Keep related messages going to same consumer (preserves ordering)

Example: Multiple service instances consuming from same Kafka topic with different consumer group IDs.


3. Re-delivery

Problem: Consumer crashes after processing but before acknowledging message. Message must be reprocessed.

Implementation:

  • Broker doesn’t remove message until consumer acknowledges
  • Consumer crashes: message automatically re-delivered to another consumer
  • Idempotent operations (see ds-problems-pad-5nov) ensure safe retries

Risk: Duplicate processing if consumer fails after processing but before acknowledgment. Use idempotent operations to make this safe.


4. Dead-letter Channel

Problem: Message causes consumer to crash repeatedly (poison message). Can’t be processed, blocks other messages.

Solution: Route problematic messages to separate “dead-letter” queue.

Process:

  1. Message fails to process N times
  2. Moved to dead-letter queue
  3. Manual inspection and remediation later
  4. Prevents infinite retry loops

Message Broker Patterns

1. Supervisor Pattern

Purpose: Monitor worker processes and handle failures intelligently.

How it works:

  • Supervisor watches multiple workers
  • If worker crashes, supervisor can:
    • Restart it (Erlang “let it crash” philosophy)
    • Route work to healthy workers
    • Escalate if repeated failures

Related: See PAD-intro for process and thread models.


2. Worker Pool Pattern

Purpose: Distribute work across multiple identical workers.

Architecture:

  • Queue of tasks
  • Fixed number of worker processes/threads
  • Each worker pulls task, processes, returns result

Benefit: Bounded concurrency - never overwhelm system with unlimited threads.


3. Aggregator Pattern

Purpose: Combine results from multiple workers into single output.

Example:

  • Scatter request to 10 data sources
  • Each returns partial result
  • Aggregator combines results
  • Returns unified response

4. Batcher Pattern

Purpose: Accumulate stream of individual events into batches, then process batch.

Why useful:

  • Database writes: Inserting 100,000 rows one at a time is slow. Batch insert is much faster.
  • Network efficiency: Send 100 small messages vs. 1 large batch
  • Processing efficiency: Process items together rather than individually

Implementation:

  • Collect events until batch size reached OR timeout expires
  • Whichever comes first
  • Process batch as atomic unit

5. Event Log Pattern

Purpose: Keep immutable record of all significant events that occur in system.

Key characteristics:

  • Append-only: Events never modified or deleted
  • Ordered: Events recorded in sequence
  • Comprehensive: Captures all state changes

Uses:

  • Audit trails: Compliance and forensics
  • Replay: Rebuild system state from events
  • Analytics: Analyze event stream

Related: Log-based message brokers like Kafka are natural fit for event logging.


6. Derived Data Pattern

Purpose: Transform raw event stream into different, optimized views.

Example:

  • Raw event: User clicked “buy” button
  • Derived data 1: Per-user purchase count (for recommendations)
  • Derived data 2: Revenue by category (for analytics)
  • Derived data 3: Fraud risk score (for security)

Advantage: Single source of truth (event log) can feed multiple derived systems without duplication.


Concurrency & Safety: CSP (Communicating Sequential Processes)

1. Fan-in

Definition: Multiple input sources → Single processor.

pad

Pattern:

Producer A ─┐
Producer B ├─→ Aggregator → Output
Producer C ─┘

Use case: Merge multiple data streams into single stream.


2. Fan-out

Definition: Single input → Multiple processors.

Pattern:

                ┌─→ Processor A
Input ──────────┼─→ Processor B
                └─→ Processor C

Use case: Broadcast message to multiple independent consumers.


3. Actor Model

Key principle: Each actor has isolated memory - no shared state.

  • Each actor is independent unit
  • Actors communicate via message passing (not shared variables)
  • By design, eliminates race conditions

Advantage over threads: See PAD-intro for race conditions, mutexes, semaphores.

  • Traditional threads: Shared memory → race conditions → need locks → deadlock risk
  • Actors: Message passing → no shared state → no races → no need for locks

Languages/frameworks:

  • Erlang/Elixir: Built on actor model
  • Akka (JVM): Actor library
  • Go: Goroutines with channels

Why this matters for distributed systems:

  • Scales naturally to distributed network (message passing works remotely too)
  • Fault isolation: One actor crash doesn’t bring down others
  • Concurrency without complexity of locks

Summary Table

FactorOnlineOfflineNear Real-time
LatencyMillisecondsMinutes/hoursSeconds
InputUser queriesBatch dataEvent stream
ThroughputVariableHighContinuous
ConcurrencyHighLowMedium
Example TechREST APIs, gRPCHadoop, SparkKafka, Storm
Best forWeb, mobile appsAnalytics, reportsMonitoring, alerts
Scaling approachHorizontal (more servers)Horizontal (data partitioning)Horizontal (consumer groups)

Key Takeaways

  1. Online systems prioritize responsiveness and concurrency
  2. Offline systems prioritize throughput and cost efficiency
  3. Near real-time systems balance both via message brokers
  4. Message brokers enable loose coupling and fault tolerance
  5. Traditional brokers suit task queues; log-based brokers suit event streaming
  6. Design patterns (supervisor, worker pool, batcher) address common distributed problems
  7. Actor model and CSP provide alternatives to shared-memory concurrency
  8. No single approach - choose based on latency, throughput, and consistency requirements