Recall, Replication = keeping copy of same data on multiple machines conneted via a network
- Keeps data close to your users
- Increases availability
- Increases read throughput
Note: for purposes of this section on replication
we’ll assume your dataset is small enough for each machine to hold a copy of the entire dataset
Where might issues arise with keeping multiple copies of the same data in a distributed fashion? Well, the hard part of course is handling the changes to the replicated data. There are 3 common algorithms for dealing with this between nodes, which we’ll look into now
- Single-leader
- Multi-leader
- Leaderless
Single Leader
The general process for a single leader architecture is as follows:
- For writing from a client
- Writes must be sent to the leader
- This single leader is allowed to write to the database.
- After writing, the leader sends the change to all its followers (via something called a replication log). Each replica updates its local data
- For reading from a client
- Clients can read from the database by querying any of the replicas or leaders
Synchronous vs Asynchronous replication
Asynchronous replication: leader sends message but doesn’t wait for response from replica. Synchronous replication: leader waits for confirmation from replica.
It would clearly be quite the feat to have a fully synchronous system at a large scale- you’d have to wait for each and every node to report back success and would thus be fault-tolerant on each replica.
Thus, semi-synchronous replication in systems is more common. This is where you make one follower synchronous and switch the node you make synchronous if it fails.
Setting up new followers
How do you set up new followers to an data system?
This isn’t as simple as it may seem- remember that the leader is constantly propagating write changes, so you can’t simply make a direct copy of it- the data is always in flux. And ideally, you could update without undergoing downtime to lock the disk.
The process typically looks like the following
- Take a consistent snapshot at some specific point in time
- Copy this snapshot to new follower node
- Request all data since the point in time the the snapshot was taken
- Sync the node by going through the data backlog
Handling Node Outages
How does one achieve high availability with the leader-based replication system we’ve outlined? Let’s approach it in terms of the two kinds of node failure we might see.
Follower failure
In the case of a follower node failing or disconnecting to the network, it can perform what’s known as catch-up recovery. It should have a local disk where its logged all of its data changes. Thus, it can just catch up on the data changes that have been transmitted by the leader since its outage.
Leader failure
When the leader node fails, we must (i) promote a new leader and (ii) switch the follower nodes to begin reading from that new leader. This process is called failover.
Typical automatic failover process
- Detect failure: commonly, consider leader node dead if it doesn’t respond for some predetermined stretch of time.
- Choose new leader: either chosen by an election process or by a controller node which is previously tasked with the decision.
- Reconfiguration: ensure that client send their writes to this new leader and the follower nodes recognize the new leader.
There are many things that can go wrong here. To name just a few-
- What if former leader had writes which weren’t communicated to the new leader?
- Split brain: in some situations, you may end up with two leaders accepting writes.
- What is the right timeout before declaring leader dead?
Replication Logs Implementation
Until now, we’ve glossed over the actual implementation of leader-based replication logs. let’s look into the most common methods for replication logs
Statement-Based Replication
The most basic replication method. In it, the leader logs each write request (statement) as a log to its followers. (e.g. send INSERT VALUE INTO ROW
).
Issues
- Non-deterministic functions (e.g.
NOW()
) - Statements which must be executed in the same order
- Statements with non-deterministic side effects
Real-world usage: VoltDB (ensures safety by requiring deterministic transactions)
Write-ahead log shipping
Leader writes to a log containing bytes of all the writes to the database, before executing the writes.
- These writes are lower-level than statement-based replication, which makes it closely coupled to the storage engine
Real-world usage: PostgreSQL and ORACLE use it.
Logical (row-based) log replication
Use different log formats for replication (logical log) and for the storage engine internals. This decoupling allows more easy backward compatibility.
Logical log- sequence of records describing writes at granularity of row. For each, the log contains…
- Inserted row: new values of all columns
- Deleted row: enough information to uniquely identify
- Updated row: enough information to uniquely identify + new values
Real-world usage: MySQL binlog uses this approach
Trigger-based Replication
What happens if you want to do things like only replicate a subset of the data or replicate between database types?
You may need to move replication into the application layer! But how? With triggers!
Triggers are custom application code you’ve registered to automatically execute when a data change occurs in database system.
Real-world usage: Databus for Oracle and Bucardo for Postgres
Problems with Replication Lag
Replication is essential for tolerating node failures, scalability, latency
Leader based replication requires all writes go througha single node (read-scaling architecture)
- Very attractive for read-heavy workloads
- Only realistic for asynchronous replication (can’t synchronously replicate to all followers)
The lag between replicas is known as replication lag
- Can become a problem at a large enough lag and scale
Reading your own writes
Issue: Many apps let users submit data and view submission
- What issues might be caused by replication lag?
- Well, you might read stale data that you just wrote
Fix?
- Read-after-write consistency- guarantee that they’ll see updates they themselves submitted
- How? A few techniques
- If reading something user may hae modified,read from the leader (e.g. read user’s own profile from leader)
- Use other criteria to decide whether to read fromthe leader- could track replication lag or how recent submission was
- Cient can remember timestamp of most recent write and ensure that replica has been served writes up to that point in time
But what about multiple devices? Harder
- No guarantee connections will route to same datacenter
- How to see timestamps across devices?
Monotonic Reads
Issue: user seeing things moving back in time
- When you read from replica that is further forward in time and then aftre a replica that still hasn’t received updaes
How to fix? monotonic reads
- Basically- guarantee that reads in sequence won’t go backwards in time
- Sample implementation: make user always make read from same replica
Consistent prefix Reads
Issue: causality broken
- e.g. messages routed in lag and answer appears before question
Sample solution: Make sure writes that are causally related to each other written to the same partition
Solutions for Replication Lag
If you’re working with an eventually consistent system, ask “how will this behave if replication lag increases”
Multi-Leader Replication
Big con with leader-based relication: only one leader and all writes need to go through it.
Multi-leader: more than one node can accept writes
- Master-master replication
- Active-active replication
Use cases
Not useful within a single datacenter, but here are some useful situations
- Multi-Datacenter Operation
- Leader in each datacenter
- Higher performance, higher tolerance of outages/network problems
- Clients with Offline Operation
- Local database which acts as leader
- Collaborative Editing
- See ‘automatic conflict resolution’ in the book TODO
Handling Write Conflicts
But now we ned conflict resolution mechanisms
Conflict avoidance Simplest (and quite common) strategy for dealing with conflicts is to avoid them
But how? Well you could make all writes for a specific record go through the same leader.
Converging towards a consistent database
We know that in single leader databases, the writes are applied in a sequential order
But we don’t have a defined ordering of writes in a multi-leader configuration
- What should the final value be?
We can’t have each replica apply writes in the order that they see them- the conflict must be resolved in a convergent way
A couple methods of convergent conflict resolutoin
- Each write is given a unique ID- pick write with highest ID as winner (LWW)
- Prone to data loss
- Each write is given a unique ID, and writes which originate at higher-numbered replica take precedence
- Merge the values somehow
- Record the conflict explicitly and have application code resolve the conflict later on
Custom conflict resolution logic
Resolving conflict may be app dependent
Your code can be executed either on write or on read
- On write: as soon as database detects conflictin log of replicated changes
- On read: When a conflict is detected, all the conflicting writes are stored. The next time the data is read, these multiple versions of the data are returned to the application- which deals with it.
What is a conflict? Obvious- 2 writes concurrently
Also- e.g. meeting booking system where mutliple book same room
Multi-leader Replication Topologies
Replicaton topology: communicatoin paths which writes take from one node to another
- Most general: all-to-all
- Each leader sends writes to all other leaders
- Circular topology
- Node receives writes from one node and forwards those writes to another node
- Star Topology
- One designated root node forwards writes to all the other nodes
Leaderless Replication
Replication so far has been based on the idea that a client sends a write rquest to one node and the database copies write to other replicas
But what about leaderless replication- accept writes directly from clients
A key distinction between different leaderless replciation methods
- Does the client write directly to several replcis or does a coordinator node do this on behalf of it?
Writing to database when node is down
Failover doesn’t exist when you’re using leaderless replication
- User sends request to several nodes in parallel
- But how does node catch up after failing and coming back? 2 mechanisms
- Read repair: If client detects stale data (ie 3/4 say something different than 1/4 nodes), writes right data back
- Anti-entropy process: background process which looks for differences in data
- Quorum for reading and writing: the number of nodes needed to consider succesful writing/reading
Limitations of Quorum Consistency
Even with conservative parameters, there are likely edge cases where stale values are returned
- These are pretty easy to think of- some unfotunae failure cases in timing and fails
In leaderless replication- how to monitor for staleness?
Sloppy Quorums and Hinted Handoffs
Databases with well set quorums can tolerate the failure of individual nodes without needing failover
But- quourms are not as fault-tolernant as they could be
- Network interruption could cut off client from large number of database nodes
How to address this? Sloppy quorum would be the practice of still accepting writes and reads, even if you switch nodes from original “home” nodes-
- Hinted handoff: after network comesonline- writes which were accepted temporarily by non-home nodes are sent to appropriate home nodes
Detecting Concurrent Writes
Even with strict quorums, some databaes allow several clients to concurrently write to the same key- conflict will occur even with strict quorums
In order to become eventually consistent, replicas should converge to the same value.
- Databases aren’t great at this currently
Important to know all the failure cases and how to handle this on the application side!
Lots potential solutions here.