Understanding DRBD Replication Modes: Ensuring Data Integrity and Performance
Distributed Replicated Block Device (DRBD) is a powerful tool used to mirror data across multiple servers, enhancing data reliability and availability. One of the key features of DRBD is its support for three distinct replication protocols, each offering different levels of data synchronicity and protection. Understanding these replication modes is crucial for configuring DRBD to meet your specific needs for data consistency, performance, and disaster recovery.
Documentation : User Guides and Product Documentation
This is the sample of DRBD resource files :
- DRBD resource files are typically named after the resource they define, such as
drbdpool.res
or similar. - These files are located in
/etc/drbd.d/
- A typical resource file will look like this:
resource mqdrbd {
protocol C;
on node1 {
device /dev/drbdpool;
disk /dev/sda;
address 192.168.1.1:7789;
meta-disk internal;
}
on node2 {
device /dev/drbdpool;
disk /dev/sda;
address 192.168.1.2:7789;
meta-disk internal;
}
}
Let’s discuss about these protocols
Protocol A: Asynchronous Replication
Official : Asynchronous replication protocol. Local write operations on the primary node are considered completed as soon as the local disk write has finished, and the replication packet has been placed in the local TCP send buffer. In the event of forced fail-over, data loss may occur. The data on the standby node is consistent after fail-over, however, the most recent updates performed prior to the crash could be lost. Protocol A is most often used in long distance replication scenarios. When used in combination with DRBD Proxy it makes an effective disaster recovery solution.
Overview: Protocol A represents the asynchronous replication mode. In this mode, local write operations on the primary node are considered complete once the local disk write has finished, and the replication packet has been placed in the local TCP send buffer. The data is then asynchronously replicated to the secondary node.
Implications:
- Data Consistency: In the event of a forced fail-over, there is a risk of data loss because the most recent updates performed on the primary node might not have been fully replicated to the secondary node. However, the data on the standby node will be consistent as of the last successful replication.
- Use Cases: Protocol A is particularly effective for long-distance replication scenarios where network latency is a concern. When combined with DRBD Proxy, it can provide a robust disaster recovery solution by mitigating the impact of network latency on replication performance.
Example Scenario: Protocol A is commonly used when the primary and secondary nodes are geographically dispersed, and network latency is unavoidable. It is also beneficial in scenarios where the focus is on disaster recovery rather than real-time data synchronization.
The protocol is asynchronous, meaning there is a delay between the write operation on the primary node and its replication to the standby node.
- How It Works: When you write data on the primary server, that write operation is considered complete as soon as the data is written to the local disk and placed in the network’s send buffer. The primary server doesn’t wait for the secondary server to receive or confirm the data before considering the operation complete.
- Pros: This method is fast because the primary server doesn’t have to wait for the secondary server.
- Cons: If the primary server fails before the data is replicated to the secondary server, some data can be lost.
- Best Use: Long-distance replication where speed is crucial, and some data loss is acceptable in rare cases.
Protocol B: Memory Synchronous (Semi-Synchronous) Replication
Official : Memory synchronous (semi-synchronous) replication protocol. Local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has reached the peer node. Normally, no writes are lost in case of forced fail-over. However, in the event of simultaneous power failure on both nodes and concurrent, irreversible destruction of the primary’s data store, the most recent writes completed on the primary may be lost.
Overview: Protocol B, or memory synchronous replication, provides a balance between performance and data safety. In this mode, local write operations are considered complete once the local disk write has occurred and the replication packet has reached the peer node. This approach ensures that the secondary node has received the replication data, thus minimizing the risk of data loss during a fail-over.
Implications:
- Data Consistency: Normally, no writes are lost in case of a forced fail-over, as the data has been acknowledged by the peer node. However, in the case of simultaneous power failure on both nodes or concurrent irreversible destruction of the primary’s data store, the most recent writes on the primary node might be lost.
- Use Cases: Protocol B is suitable for environments where data integrity is important but where some tolerance for minor data loss is acceptable. It is often used in scenarios where nodes are located in close proximity, reducing the risk of simultaneous catastrophic failures.
Example Scenario: Protocol B is ideal for deployments where ensuring data integrity is critical, but the slight possibility of data loss is acceptable in rare catastrophic situations. It is often used in high-availability setups where nodes are on the same local network or in close geographical proximity.
- How It Works: Here, the primary server waits until the data is written to its own disk and the secondary server’s memory (but not necessarily its disk) before considering the write operation complete. This ensures the secondary server has received the data.
- Pros: Typically, no data is lost if you have to switch to the secondary server, but there’s a small risk if both servers fail at the same time.
- Cons: It’s a bit slower than Protocol A because the primary server has to wait for the secondary server to acknowledge the data.
- Best Use: When you need better data safety than Protocol A but still want relatively good performance.
Protocol C: Synchronous Replication
Official : Synchronous replication protocol. Local write operations on the primary node are considered completed only after both the local and the remote disk write have been confirmed. As a result, loss of a single node is guaranteed not to lead to any data loss. Data loss is, of course, inevitable even with this replication protocol if both nodes (or their storage subsystems) are irreversibly destroyed at the same time.
Overview: Protocol C represents the synchronous replication mode, offering the highest level of data protection. In this mode, local write operations on the primary node are considered complete only after both the local and remote disk writes have been confirmed. This ensures that both nodes have the same data, eliminating the risk of data loss in the event of a node failure.
Implications:
- Data Consistency: Protocol C guarantees that there is no data loss if a single node fails, as the data has been written to both nodes. However, if both nodes or their storage subsystems are destroyed simultaneously, data loss is inevitable.
- Use Cases: Protocol C is the most commonly used replication protocol in DRBD setups due to its strong consistency guarantees. It is particularly valuable in environments where data integrity is of utmost importance and the cost of data loss is high.
Example Scenario: Protocol C is preferred in environments where real-time data consistency and high availability are critical. It is commonly used in mission-critical applications and systems where the impact of data loss would be severe.
- How It Works: This is the most cautious approach. The primary server only considers a write operation complete after the data has been written both to its own disk and the secondary server’s disk.
- Pros: This method guarantees no data loss if one server fails, making it very reliable.
- Cons: It’s the slowest of the three protocols because the primary server must wait for the secondary server to confirm the data is safely stored.
- Best Use: High-availability systems where data loss is unacceptable, and you can afford the performance hit.
Choosing the Right Protocol
The choice of replication protocol in DRBD affects two main factors:
- Protection: Higher protocol levels (B and C) provide better protection against data loss, with Protocol C offering the highest level of data consistency.
- Latency: Synchronous protocols (B and C) may introduce additional latency due to the need for remote disk writes to be acknowledged.
Throughput, however, is largely independent of the replication protocol chosen and is primarily determined by the underlying hardware and network performance.
You can read more about this topics here.