Intro
Partitioning: each piece of data belongs exactly to one partition to support scalability.
Partition can be viewed as a small database, though the database can support operations that affect multiple partitions. The dataset is distributed across many disks and the query load is distributed across many processors.
After a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines than to grow it vertically by adding beefier servers.
Query operating on a single partition is executed independently on this partition, so throughput is scaled by adding more nodes and large complex queries can be parallelized.
Types
-
replication with partitioning (RAID e.g.)
-
Partitioning by key range
- locality for range queries
- Data needs to be sorted first
- Hotspots
-
Partitioning by hash of key
- Can evaluate Hotspots
- Range queries are hard Partitioning
-
Secondary indexes
Rebalancing
- The data distribution is not uniform, e.g., there are a lot of places for a particular ZIP code, that cannot fit into one database partition.
- There are a lot of load on a shard, e.g., there are too many requests being handled by the DB shard dedicated to user photos.
Either we have to create more DB shards or have to rebalance existing shards. This means the partitioning scheme changed and all existing data moved to new locations. Doing this without incurring downtime is extremely difficult. Using a scheme like directory based partitioning does make rebalancing a more palatable experience at the cost of increasing the complexity of the system and creating a new single point of failure (i.e. the lookup service/database).
Triggers
- Query numbers
- Data volume
- Machine fails
Types of rebalancing
- static load balancing, which uses fixed rules for distribution,
- dynamic load balancing, which adjusts based on real-time server conditions and traffic.