Introduction
Distributed systems can be classified into three main categories based on how data is processed and the timing of that processing.
The classification depends on:
- Latency requirements
- Input patterns - Is data event-driven or scheduled?
- Processing model - Synchronous interaction vs. asynchronous batch processing
Type 1: Online Systems
Systems designed to respond to user queries in real-time with synchronous request-response patterns.
Characteristics
- Latency: Low latency required (milliseconds to seconds)
- Throughput: Variable, depends on user load
- Concurrency: High - many simultaneous users
- Input: User-initiated requests (synchronous)
- Coupling: Tightly coupled request-response
Examples
- Web servers and APIs (REST, GraphQL)
- E-commerce platforms
- Real-time collaboration tools
- Chat applications
Key Considerations
- Responsiveness: Must handle user expectations for fast responses
- Scalability: Need to handle concurrent users (see partitioning for horizontal scaling)
- Reliability: User-facing, so failures are immediately visible
- Network resilience: See ds-problems-pad-5nov for handling network delays and failures
Common Patterns
- Request-response RPC (Remote Procedure Call)
- REST API design
- Load balancing across multiple instances
Type 2: Offline Systems
Systems that process large batches of data on a schedule or on-demand, without direct user interaction.
Characteristics
- Latency: High latency acceptable (minutes to hours)
- Throughput: High - process large volumes efficiently
- Concurrency: N/A - single or few batch jobs at a time
- Input: Scheduled data (does not depend on online user activity)
- Coupling: Loosely coupled, fire-and-forget
Examples
- Data warehouse ETL (Extract, Transform, Load)
- Analytics pipelines and reporting
- Machine learning model training
- Data aggregation and cleaning jobs
- Scheduled backups and maintenance
Key Considerations
- Throughput optimization: Maximize processing efficiency
- Resource efficiency: Use resources effectively since jobs run infrequently
- Data consistency: Can enforce strong guarantees since no concurrent access
- Data partitioning: See partitioning for distributing large datasets across nodes
Common Patterns
- MapReduce and batch processing frameworks
- Scheduled cron jobs
- Data lakes and warehouses
- Eventual consistency models
Type 3: Near Real-time Systems
Systems that process continuous streams of data with low-to-moderate latency requirements, often event-driven.
Characteristics
- Latency: Moderate (seconds to minutes)
- Throughput: Continuous, unbounded stream
- Concurrency: Multiple events processing in parallel
- Input: Continuous stream of events
- Coupling: Loosely coupled via message brokers
Examples
- IoT sensor data processing
- Video stream processing and analytics
- Real-time event aggregation (clickstreams, logs)
- Fraud detection systems
- Real-time dashboards and monitoring
Key Considerations
- Ordering guarantees: May need to preserve event causality
- Backpressure handling: What happens when events arrive faster than processing?
- Fault tolerance: Missed events can cause data loss
- Exactly-once semantics: Balance between at-least-once and at-most-once delivery
Message Brokers: The Core of Near Real-time Systems
Purpose
intermediaries that manage asynchronous communication between producers (senders) and consumers (receivers). They decouple components so they don’t need direct knowledge of each other.
Architecture Pattern: Pub/Sub
Publisher-Subscriber: Publishers emit messages; brokers route them to interested subscribers.
Benefits:
- Decoupling: Publishers don’t know about subscribers
- Scalability: Add new subscribers without changing publishers
- Flexibility: Subscribers can process messages at their own pace
Message Broker Types
1. Traditional Message Brokers (RabbitMQ)
Architecture: Queue-based
- Messages are stored in queues
- Consumer pulls (or broker pushes) messages
- Message is deleted after consumer acknowledges
How it works:
- Producer sends message to queue
- Consumer fetches message from queue
- Consumer processes and acknowledges
- Message is removed from queue
Characteristics:
- Built-in acknowledgment mechanisms
- Support for explicit message routing
- Generally lower storage overhead
- Good for short-lived, transient messages
Pros:
- Simple, well-understood model
- Good acknowledgment guarantees
- Efficient for low-latency requirements
Cons:
- Cannot replay past messages (not persisted long-term)
- Message loss if broker crashes before acknowledgment
- Less suitable for long-term event streaming
2. Log-based Message Brokers (Kafka)
Architecture: Append-only log
- Messages are written to an immutable log
- Consumers maintain offset pointers to track their position
- Messages persist indefinitely (or until retention period expires)
How it works:
- Producer appends message to log
- Consumer reads from their last offset
- Consumer processes message
- Consumer commits their offset (persists progress)
- Message remains in log for other consumers or replay
Characteristics:
- All messages permanently recorded (until deletion)
- Consumers track their own progress via offsets
- Multiple consumers can read same message independently
- Supports message replay
Pros:
- Replay messages from any point in time
- Multiple independent consumers
- Durable, fault-tolerant persistence
- Better for long-term event sourcing and analytics
Cons:
- Higher storage overhead (keeping all messages)
- More complex offset management
- Slightly higher latency due to persistent writes
Comparison: Traditional vs. Log-based
| Aspect | Traditional (RabbitMQ) | Log-based (Kafka) |
|---|---|---|
| Persistence | Transient (deleted after consumption) | Permanent (append-only log) |
| Consumption | Message deleted after ack | Multiple independent consumers |
| Replay | Not possible | Supported via offset |
| Use Case | Task queues, command delivery | Event streaming, analytics |
| Storage | Lower | Higher |
| Semantics | At-most-once to exactly-once | At-least-once to exactly-once |
Message Broker Enhancements
1. Backpressure
Problem: Producer generates messages faster than consumer can process them. Queue grows unboundedly → memory exhaustion → system crash.
Definition: Mechanism to slow down producers when consumers fall behind.
Implementation patterns:
- Reactive backpressure: Consumer signals to broker when overwhelmed
- Rate limiting: Producer throttles message production
- Queue depth monitoring: Stop accepting messages when queue exceeds threshold
Best practice: Design systems to handle backpressure gracefully rather than just dropping messages.
2. Load Balancing
Purpose: Distribute work across multiple consumer instances.
Strategies:
- Round-robin: Send messages to consumers in sequence
- Least-loaded: Send to consumer with fewest pending messages
- Sticky routing: Keep related messages going to same consumer (preserves ordering)
Example: Multiple service instances consuming from same Kafka topic with different consumer group IDs.
3. Re-delivery
Problem: Consumer crashes after processing but before acknowledging message. Message must be reprocessed.
Implementation:
- Broker doesn’t remove message until consumer acknowledges
- Consumer crashes: message automatically re-delivered to another consumer
- Idempotent operations (see ds-problems-pad-5nov) ensure safe retries
Risk: Duplicate processing if consumer fails after processing but before acknowledgment. Use idempotent operations to make this safe.
4. Dead-letter Channel
Problem: Message causes consumer to crash repeatedly (poison message). Can’t be processed, blocks other messages.
Solution: Route problematic messages to separate “dead-letter” queue.
Process:
- Message fails to process N times
- Moved to dead-letter queue
- Manual inspection and remediation later
- Prevents infinite retry loops
Message Broker Patterns
1. Supervisor Pattern
Purpose: Monitor worker processes and handle failures intelligently.
How it works:
- Supervisor watches multiple workers
- If worker crashes, supervisor can:
- Restart it (Erlang “let it crash” philosophy)
- Route work to healthy workers
- Escalate if repeated failures
Related: See PAD-intro for process and thread models.
2. Worker Pool Pattern
Purpose: Distribute work across multiple identical workers.
Architecture:
- Queue of tasks
- Fixed number of worker processes/threads
- Each worker pulls task, processes, returns result
Benefit: Bounded concurrency - never overwhelm system with unlimited threads.
3. Aggregator Pattern
Purpose: Combine results from multiple workers into single output.
Example:
- Scatter request to 10 data sources
- Each returns partial result
- Aggregator combines results
- Returns unified response
4. Batcher Pattern
Purpose: Accumulate stream of individual events into batches, then process batch.
Why useful:
- Database writes: Inserting 100,000 rows one at a time is slow. Batch insert is much faster.
- Network efficiency: Send 100 small messages vs. 1 large batch
- Processing efficiency: Process items together rather than individually
Implementation:
- Collect events until batch size reached OR timeout expires
- Whichever comes first
- Process batch as atomic unit
5. Event Log Pattern
Purpose: Keep immutable record of all significant events that occur in system.
Key characteristics:
- Append-only: Events never modified or deleted
- Ordered: Events recorded in sequence
- Comprehensive: Captures all state changes
Uses:
- Audit trails: Compliance and forensics
- Replay: Rebuild system state from events
- Analytics: Analyze event stream
Related: Log-based message brokers like Kafka are natural fit for event logging.
6. Derived Data Pattern
Purpose: Transform raw event stream into different, optimized views.
Example:
- Raw event: User clicked “buy” button
- Derived data 1: Per-user purchase count (for recommendations)
- Derived data 2: Revenue by category (for analytics)
- Derived data 3: Fraud risk score (for security)
Advantage: Single source of truth (event log) can feed multiple derived systems without duplication.
Concurrency & Safety: CSP (Communicating Sequential Processes)
1. Fan-in
Definition: Multiple input sources → Single processor.
Pattern:
Producer A ─┐
Producer B ├─→ Aggregator → Output
Producer C ─┘
Use case: Merge multiple data streams into single stream.
2. Fan-out
Definition: Single input → Multiple processors.
Pattern:
┌─→ Processor A
Input ──────────┼─→ Processor B
└─→ Processor C
Use case: Broadcast message to multiple independent consumers.
3. Actor Model
Key principle: Each actor has isolated memory - no shared state.
- Each actor is independent unit
- Actors communicate via message passing (not shared variables)
- By design, eliminates race conditions
Advantage over threads: See PAD-intro for race conditions, mutexes, semaphores.
- Traditional threads: Shared memory → race conditions → need locks → deadlock risk
- Actors: Message passing → no shared state → no races → no need for locks
Languages/frameworks:
- Erlang/Elixir: Built on actor model
- Akka (JVM): Actor library
- Go: Goroutines with channels
Why this matters for distributed systems:
- Scales naturally to distributed network (message passing works remotely too)
- Fault isolation: One actor crash doesn’t bring down others
- Concurrency without complexity of locks
Summary Table
| Factor | Online | Offline | Near Real-time |
|---|---|---|---|
| Latency | Milliseconds | Minutes/hours | Seconds |
| Input | User queries | Batch data | Event stream |
| Throughput | Variable | High | Continuous |
| Concurrency | High | Low | Medium |
| Example Tech | REST APIs, gRPC | Hadoop, Spark | Kafka, Storm |
| Best for | Web, mobile apps | Analytics, reports | Monitoring, alerts |
| Scaling approach | Horizontal (more servers) | Horizontal (data partitioning) | Horizontal (consumer groups) |
Key Takeaways
- Online systems prioritize responsiveness and concurrency
- Offline systems prioritize throughput and cost efficiency
- Near real-time systems balance both via message brokers
- Message brokers enable loose coupling and fault tolerance
- Traditional brokers suit task queues; log-based brokers suit event streaming
- Design patterns (supervisor, worker pool, batcher) address common distributed problems
- Actor model and CSP provide alternatives to shared-memory concurrency
- No single approach - choose based on latency, throughput, and consistency requirements