Replicated Data Queue Manager
HA RDQM Ingredients :
- IBM MQ (https://www.ibm.com/docs/en/ibm-mq)
- Corosync (http://corosync.github.io/corosync/)
- Pacemaker (https://clusterlabs.org/pacemaker/)
- DRBD (https://linbit.com/drbd/)
What is thus three working for :
- Corosync ensures reliable communication and state synchronization between cluster nodes.
- Pacemaker manages the failover and availability of the IBM MQ queue managers.
- DRBD provides real-time data replication between nodes to ensure data consistency and availability.
RDQM (replicated data queue manager) is a high availability solution that is available on Linux® platforms.
An RDQM configuration consists of three servers configured in a high availability (HA) group, each with an instance of the queue manager. One instance is the running queue manager, which synchronously replicates its data to the other two instances. If the server running this queue manager fails, another instance of the queue manager starts and has current data to operate with. The three instances of the queue manager share a floating IP address, so clients only need to be configured with a single IP address.
Only one instance of the queue manager can run at any one time, even if the HA group becomes partitioned due to network problems. The server running the queue manager is known as the ‘primary’, then the other two servers is known as a ‘secondary’.
Three nodes are used to greatly reduce the possibility of a split-brain situation arising. In a two-node High Availability system split-brain can occur when the connectivity between the two nodes is broken. With no connectivity, both nodes could run the queue manager at the same time, accumulating different data. When connection is restored, there are two different versions of the data (a ‘split-brain’), and manual intervention is required to decide which data set to keep, and which to discard.
RDQM uses a three-node system with quorum to avoid the split-brain situation. Nodes that can communicate with at least one of the other nodes form a quorum. Queue managers can only run on a node that has quorum. The queue manager cannot run on a node which is not connected to at least one other node, so can never run on two nodes at the same time:
- If a single node fails, the queue manager can run on one of the other two nodes. If two nodes fail, the queue manager cannot run on the remaining node because the node does not have quorum (the remaining node cannot tell whether the other two nodes have failed, or they are still running, and it has lost connectivity).
- If a single node loses connectivity, the queue manager cannot run on this node because the node does not have quorum. The queue manager can run on one of the remaining two nodes, which do have quorum. If all nodes lose connectivity, the queue manager is unable to run on any of the nodes, because none of the nodes have quorum.
You also need to know the protocol that used by the drbd you can read Understanding DRBD Replication Modes: Ensuring Data Integrity and Performance