While replication was sufficient for datasets which can fit on each instance of your machines, we’ll need partitioning to break our data into partitions

Note: there are lots of synonyms for partition- shard (MongoDB/Elasticsearch/SolrCloud), region (HBase), tablet (Bigtable), vnode (Cassandra/Riak)

Why do we want to do this? Mostly for scalability

You can distribute a large dataset across many disks by putting different partitions on different nodes in a shared-nothing cluster
- If you need to query a single partition, that node can independently execute the query

Partitioning and Replication

It’s important to note that partitioning is almost never employed alone- and often is combined with replication.

What does this mean?

Records will be stored on several nodes for fault tolerance
Additionally, nodes can store several partitions

This leads to interesting architectures- for example, in a leader-follower replication model, you will often have partition leaders and followers (for separate partitions) on the same node!

Partitioning Key-Value Data

How does one decide how to partition data?

Remember, our goal is to spread the query load and data evenly across the nodes

The naive approach would be to randomly assign records to nodes. The data would certainly be eveny distributed, but can you think of any issues with this approach?

You wouldn’t know which node the item is on! You’d have to query all the nodes!

Let’s explore some better methods.

Partitioning by Key Range

Method of partitioning: partition based on separate continuous ranges of keys (vis. a library)

How to choose the boundaries? Either

Administrator
Or automatically (see TODO)

Cons

Can lead to hotspots- doesn’t necessarily delineate partitions based on query volume

Pros

Can keep sorted order within each partition- great for range queries

Partitioning by Hash of Key

Method of partitioning: partition based on separate continuous range of hashes of keys

How to choose a hash function?

Goal of hash function: take skewed data and uniformly distribute it
Doesn’t need to be cryptographically strong
Real-world hash functions: MD5 (Cassandra, MongoDB), Fowler-Noll-Vo (Voldemort)

Pros

Good at distributing keys fairly among partitions

Cons

Terrible for range queries

Note: Consistent hashing is what we call hash partitioning when you pseudorandomly pick the partition boundaries. It is rarely used and doesn’t actually work well.

Relieving Hot Spots

So, hashing a key can help reduce hotspots, but doesn’t guarantee a solution.

What should you do if you end up with a highly skewed workload? (e.g. a celebrity joins your burgeoning social media platform)

Sadly, modern databases typically can’t account for massive skewing, so you’ll have to handle it in your application

Partitioning and Secondary Indexes

Remember: secondary indexes

But how do we partition data when we have secondary indexes?, which don’t usually identify records uniquely.

2 main approaches to partitioning a database with secondary indexes

Document-based
Term-based

Partitioning Secondary Indexes by Document

Example scenario:

Website for selling used cars
You’ve partitioned database by unique Document ID
You want to let users filter by color and make, so you need a secondary index on color nad make

The document-based approach

Separate (local index) secondary index for each partition
Secondary index maps field (ie. color) to the unique Document ID

Con

You’ll have to query all partitions if you want to search for specific color or make (unless you’ve done something special in the set up)

Note: this approach to querying partitioned database is sometimes referred to as scatter/gather

Partitioning Secondary Indexes by Term

The term-based approach

A global index which covers data in all partitions
This index must be partitioned
- Can use hash or term itself

Pros

More efficient reads (no need for scatter/gather)

Cons

Writes are slower

Rebalancing Partitions

As well know, nothing good in life lasts. The same is true in databases. Here are some changes you’ll need to account for

Query throughput increases → you’ll want to add more CPUs
Dataset size increase → you’ll want to add more disks and RAM
Machine fails → other machine needs to take over

But how can you do this when you have your distributed nodes? Through a process called rebalancing.

What are the goals for rebalancing?

(i) Keep load fairly shared
(ii) No downtime
(iii) As little data change

Strategies for Rebalancing

Hash mod N (don’t do it this way!)

The approach: Assign records to partitions based on their hash mod N

Why is this terrible?

As N (number of nodes) changes, you’ll need to switch a whole bunch of data!

Fixed number of partitions

The approach:

Create a fixed number of many partitions
Assign several to each node
Upon new node’s creation, have the new node steal some partitions until they’re distributed evenly

Real-world use: Riak, Elasticsearch, Couchbase, Voldemort

Dynamic partitioning

The approach: Divide partition into two when it exceeds a configured size. Converse as well- when partition shrinks to specific level, merge with adjacent partition

Pro: number of partitions adapts to data volume

Caveat: empty database must start with a single partition (how would you know where to draw partition boundaries?), so initially all writes are handled by a single node

Partitioning proportionally to nodes

The approach: have a fixed number of partitioner per node- thus, the size of each partition grows proprtionally to dataset size.

Operations: is Rebalancing Automatic?

Does rebalancing happen automatically or manually?

Generally, it is good to have a human in the loop for rebalancing

Fully automating rebalancing can be efficient, but unpredictable
Rebalancing is expensive: important to be careful

Request Routing

We have now: parittionined our dataset across nodes across machines

But, how does client know which node to connect to? The nodes which partitions are assigned to is constantly changing

There are a few approaches we can take

Let client contact any node (node intermediary)
1. If node owns partition, it handles the request
2. Else, passes along request to appropriate node and passes back reply
Send requests to routing tier first (load balancer)
1. Routing tier determines node which should handle each request + forwards it to the node
Make client aware of partitioning

But still: how do the components that make the routing decision actually learn about the changing assignmnets of partitions to nodes?

Separate coordination system (e.g. Zookeeper)
1. Nodes register in ZooKeeper and ZooKeeper maintains mapping
2. Routing tiers subscribe to info in ZooTier
Gossip protocol

In sum

Partitioning

Goal: spread data + query load evenly across machines, avoiding hot stpots
Necessary when: storing + proceessing data on single machine isn’t feasible

2 main approaches

Key Range Partitioning
Hash Partitioning

Partitioning and secondary index. Partitioning secondary indexes-

Document-partitioned indexes
Term-partitioned indexes

Query routing

Pablo's Reference Notes

Explorer

Partitioning

Partitioning and Replication

Partitioning Key-Value Data

Partitioning by Key Range

Partitioning by Hash of Key

Relieving Hot Spots

Partitioning and Secondary Indexes

Partitioning Secondary Indexes by Document

Partitioning Secondary Indexes by Term

Rebalancing Partitions

Strategies for Rebalancing

Hash mod N (don’t do it this way!)

Fixed number of partitions

Dynamic partitioning

Partitioning proportionally to nodes

Operations: is Rebalancing Automatic?

Request Routing

In sum

Graph View

Table of Contents

Backlinks