Split Brain problem zen-metastoredb OCP 4.x
Disclaimer: All of the salt and password hash is dummy
In this article, I want to share an experience that was quite perplexing at the beginning, as there were many components that needed to be checked. Eventually, I figured out where the problematic components were.
I encountered an issue with inconsistent user Access Control data, where deleted users were still appearing in the web console’s display.
We are expecting the data should be like this.
From this condition, it can be seen that the data displayed on the web console is not consistent with the data that should be shown.
My team and I have conducted checks on various OCP components, such as examining the iam-onboarding, zenmetastore-onboarding, checking at the PVC level, and many other components.
Finally, we found the component that was actually the root cause, which is zen-metastoredb.
Let's start with the pods list.
oc project cp4i
oc get pods | grep metastoredb
zen-metastoredb-0 1/1 Running 0 24h
zen-metastoredb-1 1/1 Running 0 24h
zen-metastoredb-2 1/1 Running 0 24h
Perform an rsh to the pods you want to reach.
oc rsh zen-metastoredb-0
oc rsh zen-metastoredb-1
oc rsh zen-metastoredb-2
To be able to query the Cockroach database, you need to follow these steps.
Ensure you have the necessary credentials and access to the Cockroach database.
Use the cockroach sql
command to start a SQL shell. You might need to provide connection details such as the host and credentials.
cp -r /certs/ /tmp/certs-new/
cd /tmp/ && chmod -R 0700 certs-new/
chmod 0600 /tmp/certs-new/*
cd /cockroach/
Execute this command, depending on the pods where you at.
./cockroach sql --certs-dir=/tmp/certs-new/ --host=zen-metastoredb-0.zen-metastoredb
./cockroach sql --certs-dir=/tmp/certs-new/ --host=zen-metastoredb-1.zen-metastoredb
./cockroach sql --certs-dir=/tmp/certs-new/ --host=zen-metastoredb-2.zen-metastoredb
Once you are connected, you can start executing SQL queries in the shell to interact with the database.
user zen_user;
show tables;
use zen;
select * from platform_users;
root@zen-metastoredb-0.zen-metastoredb:26257/zen> select * from platform_users;
uid | username | displayName | email | salt | password_hash | apikey_hash | apikey_salt | approval_status | permissions | user_roles | current_account_status | internal_user | deletable | authenticator | created_timestamp | last_modified_timestamp | misc
-------------+-----------+-------------+-----------+------------------+----------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+------------------------+---------------+-----------+---------------+-------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------
1000330999 | admin | admin | -- | dc96a625f2347a6d | 102c94e4b04aadd8f9a2cgh6b00c77ee84e15dd7a58fd6e83a31ec80dfb03c1dbc0c5ad3700c3cf6bfd48b7cb23341ed0b215c5bc7cf3521e9b7abf0e3b583c8 | NULL | NULL | approved | {administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances,kubernetes_administrator} | {zen_administrator_role,iaf-automation-admin,iaf-automation-admin} | enabled | false | false | default | 1687760004025 | 1698939370514 | {"dark_mode": false, "last_session_ended_timestamp": 1698903053406, "last_session_start_timestamp": 1698903041}
1000331001 | kubeadmin | kubeadmin | kubeadmin | undefined | undefined | NULL | NULL | approved | {administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances,kubernetes_administrator} | {zen_administrator_role,iaf-automation-admin,iaf-automation-admin} | enabled | false | true | external | 1687761381292 | 1698939370898 | {"dark_mode": false, "last_session_ended_timestamp": 1698804571653, "last_session_start_timestamp": 1698804562, "sessionInfo": {}}
(2 rows)
root@zen-metastoredb-1.zen-metastoredb:26257/zen> select * from platform_users;
uid | username | displayName | email | salt | password_hash | apikey_hash | apikey_salt | approval_status | permissions | user_roles | current_account_status | internal_user | deletable | authenticator | created_timestamp | last_modified_timestamp | misc
-------------+-----------+-------------+-----------+------------------+----------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+------------------------+---------------+-----------+---------------+-------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------
1000330999 | admin | admin | -- | a7181dkle4a90a91 | 102c94e4b04aadd8f9a2cf06b00c77ee84e15dd7a58fd6e59031ec80dfb03c1dbc0c5ad3700c3cf6bfd48b7cb23341ed0b215c5bc7cf3521e9b7abf0e3b583c8 | NULL | NULL | approved | {administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances,kubernetes_administrator} | {zen_administrator_role,iaf-automation-admin,iaf-automation-admin} | enabled | false | false | default | 1698811637890 | 1698938555703 | {"dark_mode": false, "last_session_ended_timestamp": 1698925638452, "last_session_start_timestamp": 1698925007}
1000331001 | kubeadmin | | kubeadmin | undefined | undefined | NULL | NULL | approved | {administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances,kubernetes_administrator} | {zen_administrator_role,iaf-automation-admin,iaf-automation-admin} | enabled | false | true | external | 1698847763787 | 1698938556075 | {"dark_mode": false, "last_session_ended_timestamp": "", "last_session_start_timestamp": "", "realm_name": "OCP"}
1000331002 | prd02 | prd02 | | undefined | undefined | NULL | NULL | approved | {kubernetes_administrator,kubernetes_viewer,kubernetes_operator,kubernetes_editor,administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances} | {iaf-automation-admin,iaf-automation-analyst,iaf-automation-developer,iaf-automation-operator,zen_administrator_role} | enabled | false | true | external | 1698854515482 | 1698904676922 | {"dark_mode": false, "extAttributes": {}, "last_session_ended_timestamp": "", "last_session_start_timestamp": "", "realm_name": "ROKS"}
1000331003 | qas01 | qas01 | | undefined | undefined | NULL | NULL | approved | {kubernetes_administrator,kubernetes_viewer,kubernetes_operator,kubernetes_editor,administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances} | {iaf-automation-admin,iaf-automation-analyst,iaf-automation-developer,iaf-automation-operator,zen_administrator_role} | enabled | false | true | external | 1698854843112 | 1698854843112 | {"extAttributes": {}, "last_session_ended_timestamp": "", "last_session_start_timestamp": "", "realm_name": "ROKS"}
1000331004 | prd01 | prd01 | | undefined | undefined | NULL | NULL | approved | {kubernetes_administrator,kubernetes_viewer,kubernetes_operator,kubernetes_editor,administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances} | {iaf-automation-admin,iaf-automation-analyst,iaf-automation-developer,iaf-automation-operator,zen_administrator_role} | enabled | false | true | external | 1698903833284 | 1698903833284 | {"extAttributes": {}, "last_session_ended_timestamp": 1698904661231, "last_session_start_timestamp": 1698904562, "realm_name": "ROKS"}
(5 rows)
root@zen-metastoredb-2.zen-metastoredb:26257/zen> select * from platform_users;
uid | username | displayName | email | salt | password_hash | apikey_hash | apikey_salt | approval_status | permissions | user_roles | current_account_status | internal_user | deletable | authenticator | created_timestamp | last_modified_timestamp | misc
-------------+-----------+-------------+-----------+------------------+----------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+------------------------+---------------+-----------+---------------+-------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------
1000330999 | admin | admin | -- | dc96a6mkf2897a6d | 102c94e4b04aadd8f9a2cf06b00c77ee84e15dd7a58fd6e83a31ec80dfb03c1dbc0c5ad3700c3cf6bfd48b7cb23341ed0b215c5bc7cf3521e9b7abf0e3b583c8 | NULL | NULL | approved | {administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances,kubernetes_administrator} | {zen_administrator_role,iaf-automation-admin,iaf-automation-admin} | enabled | false | false | default | 1687760004025 | 1698939122922 | {"dark_mode": false, "last_session_ended_timestamp": 1698903053406, "last_session_start_timestamp": 1698903041}
1000331001 | kubeadmin | kubeadmin | kubeadmin | undefined | undefined | NULL | NULL | approved | {administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances,kubernetes_administrator} | {zen_administrator_role,iaf-automation-admin,iaf-automation-admin} | enabled | false | true | external | 1687761381292 | 1698939123284 | {"dark_mode": false, "last_session_ended_timestamp": 1698804571653, "last_session_start_timestamp": 1698804562, "sessionInfo": {}}
(2 rows)
In the initial condition, we observed that there was a discrepancy in the data stored within the pods. Inconsistent data was found across 3 different pods:
- zen-metastoredb-0 (2 Users)
- zen-metastoredb-1 (5 Users)
- zen-metastoredb-2 (2 Users)
At this point, we identified the solution to address the issue: scaling down the zen-metastoredb
statefulset.
But before proceeding, we need to determine the Persistent Volume Claims (PVC) held by the zen-metastoredb
pods.
oc get pvc | grep metastoredb
datadir-zen-metastoredb-0 Bound pvc-100560d2-40d3-496c-8736-e646940bddf9 10Gi RWO ocs-storagecluster-cephfs 105d
datadir-zen-metastoredb-1 Bound pvc-01128c21-aa10–4210-ae66–5ed0777c3e44 10Gi RWO ocs-storagecluster-cephfs 105d
datadir-zen-metastoredb-2 Bound pvc-27a742cf-95ea-498e-ba77-591b72926e1b 10Gi RWO ocs-storagecluster-cephfs 105d
In this case, the problematic pod is zen-metastoredb-1
, which has a PVC named datadir-zen-metastoredb-1
with the PVC ID pvc-01128c21-aa10–4210-ae66–5ed0777c3e44
. Before deleting the PVC, you should scale down the StatefulSet to 0 replicas.
oc get statefulsets | grep metastoredb
oc scale sts zen-mestoredb --replicas=0
Wait for a moment until all the pods are in the Terminating
state, and the OpenShift system successfully completes the scaling down process.
oc project cp4i
oc get pods | grep metastoredb
zen-metastoredb-0 1/1 Terminating 0 24h
zen-metastoredb-1 1/1 Terminating 0 24h
zen-metastoredb-2 1/1 Terminating 0 24h
oc project cp4i
oc get pods | grep metastoredb
<It should be got nothing>
After running oc get pods
, if the metastoredb pods no longer display any resources, you can proceed to delete the PVC.
Make sure to delete only the PVC associated with the problematic pod; do not delete the wrong PVC.
oc delete pvc datadir-zen-metastoredb-1
persistentvolumeclaim "datadir-zen-metastoredb-1" deleted
oc get pvc | grep metastoredb
datadir-zen-metastoredb-0 Bound pvc-100560d2-40d3-496c-8736-e646940bddf9 10Gi RWO ocs-storagecluster-cephfs 105d
datadir-zen-metastoredb-2 Bound pvc-27a742cf-95ea-498e-ba77-591b72926e1b 10Gi RWO ocs-storagecluster-cephfs 105d
And do not touch the Persistent Volume
After successfully deleting the PVC, you can proceed with Scaling Up the StatefulSet.
oc scale sts zen-mestoredb --replicas=3
oc project cp4i
oc get pods | grep metastoredb
zen-metastoredb-0 1/1 Running 0 24h
zen-metastoredb-1 1/1 Running 0 24h
zen-metastoredb-2 1/1 Running 0 24h
Once the metastoredb is running again, check inside the pods. You can run a shell inside the pods.
oc rsh zen-metastoredb-0
oc rsh zen-metastoredb-1
oc rsh zen-metastoredb-2
Export the certificates again since we scaled down.
cp -r /certs/ /tmp/certs-new/
cd /tmp/ && chmod -R 0700 certs-new/
chmod 0600 /tmp/certs-new/*
cd /cockroach/
./cockroach node status --certs-dir=/tmp/certs/ --host=zen-metastoredb-0.zen-metastoredb
./cockroach node status --certs-dir=/tmp/certs/ --host=zen-metastoredb-1.zen-metastoredb
./cockroach node status --certs-dir=/tmp/certs/ --host=zen-metastoredb-2.zen-metastoredb
./cockroach node status --certs-dir=/tmp/certs-new/ --host=zen-metastoredb-0.zen-metastoredb
id | address | sql_address | build | started ac 1 | updated_at | locality is_availsble | is_live
1 | zen-metastoredb-0.zen-metastoredb:26257 | zen-metastoredb-0.zen-metastoredb:26257 | v22.1.5 | 2023-11-02 16:05:39.244434 | 2023-11-02 16:08:57.272683 | true | true
2 | zen-metastoredb-1.zen-metastoredb:26257 | zen-metastoredb-1.zen-metastoredb:26257 | v22.1.5 | 2023-11-02 16:05:41.789664 | 2023-11-02 16:08:59.82766 | true | true
3 | zen-metastoredb-2.zen-metastoredb:26257 | zen-metastoredb-2.zen-metastoredb:26257 | v22.1.5 | 2023-11-02 16:05:39.219721 | 2023-11-02 16:08:57.634207 | true | true
(3 rows)
./cockroach node status --certs-dir=/tmp/certs-new/ --host=zen-metastoredb-1.zen-metastoredb
id | address | sql_address | build | started ac 1 | updated_at | locality is_availsble | is_live
1 | zen-metastoredb-0.zen-metastoredb:26257 | zen-metastoredb-0.zen-metastoredb:26257 | v22.1.5 | 2023-11-02 16:05:39.244434 | 2023-11-02 16:08:57.272683 | true | true
2 | zen-metastoredb-1.zen-metastoredb:26257 | zen-metastoredb-1.zen-metastoredb:26257 | v22.1.5 | 2023-11-02 16:05:41.789664 | 2023-11-02 16:08:59.82766 | true | true
3 | zen-metastoredb-2.zen-metastoredb:26257 | zen-metastoredb-2.zen-metastoredb:26257 | v22.1.5 | 2023-11-02 16:05:39.219721 | 2023-11-02 16:08:57.634207 | true | true
(3 rows)
./cockroach node status --certs-dir=/tmp/certs-new/ --host=zen-metastoredb-2.zen-metastoredb
id | address | sql_address | build | started ac 1 | updated_at | locality is_availsble | is_live
1 | zen-metastoredb-0.zen-metastoredb:26257 | zen-metastoredb-0.zen-metastoredb:26257 | v22.1.5 | 2023-11-02 16:05:39.244434 | 2023-11-02 16:08:57.272683 | true | true
2 | zen-metastoredb-1.zen-metastoredb:26257 | zen-metastoredb-1.zen-metastoredb:26257 | v22.1.5 | 2023-11-02 16:05:41.789664 | 2023-11-02 16:08:59.82766 | true | true
3 | zen-metastoredb-2.zen-metastoredb:26257 | zen-metastoredb-2.zen-metastoredb:26257 | v22.1.5 | 2023-11-02 16:05:39.219721 | 2023-11-02 16:08:57.634207 | true | true
(3 rows)
Make sure that the node status in each pod has the same condition, the column is_live
with true value. After that, we need to check the condition of the Cockroach database, whether the data stored in the database is now identical across all pods.
user zen_user;
show tables;
use zen;
select * from platform_users;
root@zen-metastoredb-0.zen-metastoredb:26257/zen> select * from platform_users;
uid | username | displayName | email | salt | password_hash | apikey_hash | apikey_salt | approval_status | permissions | user_roles | current_account_status | internal_user | deletable | authenticator | created_timestamp | last_modified_timestamp | misc
-------------+-----------+-------------+-----------+------------------+----------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+------------------------+---------------+-----------+---------------+-------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------
1000330999 | admin | admin | -- | dc96a62554897a6d | 102c94e4b04aadd8f9a2cf06b00c77ee84d159d7a58fd6e83a31ec80dfb03c1dbc0c5ad3700x3cf6bfd48b7cb23341ed0b215c5bc7cf3521e9b7abf0e3b583c8 | NULL | NULL | approved | {administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances,kubernetes_administrator} | {zen_administrator_role,iaf-automation-admin,iaf-automation-admin} | enabled | false | false | default | 1687760004025 | 1698939370514 | {"dark_mode": false, "last_session_ended_timestamp": 1698903053406, "last_session_start_timestamp": 1698903041}
1000331001 | kubeadmin | kubeadmin | kubeadmin | undefined | undefined | NULL | NULL | approved | {administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances,kubernetes_administrator} | {zen_administrator_role,iaf-automation-admin,iaf-automation-admin} | enabled | false | true | external | 1687761381292 | 1698939370898 | {"dark_mode": false, "last_session_ended_timestamp": 1698804571653, "last_session_start_timestamp": 1698804562, "sessionInfo": {}}
(2 rows)
root@zen-metastoredb-1.zen-metastoredb:26257/zen> select * from platform_users;
uid | username | displayName | email | salt | password_hash | apikey_hash | apikey_salt | approval_status | permissions | user_roles | current_account_status | internal_user | deletable | authenticator | created_timestamp | last_modified_timestamp | misc
-------------+-----------+-------------+-----------+------------------+----------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+------------------------+---------------+-----------+---------------+-------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------
1000330999 | admin | admin | -- | a7181d1ea4a56a91 | 102c94e4b04aadd8f9a2cf06b00c77ee84e14dd9a58fd6e83a31ec80dfb03c1dbc0c5ad3700c3cf6bfd48b7cb23341ed0b215c5bc7cf3521e9b7abf0e3b583c8 | NULL | NULL | approved | {administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances,kubernetes_administrator} | {zen_administrator_role,iaf-automation-admin,iaf-automation-admin} | enabled | false | false | default | 1698811637890 | 1698938555703 | {"dark_mode": false, "last_session_ended_timestamp": 1698925638452, "last_session_start_timestamp": 1698925007}
1000331001 | kubeadmin | kubeadmin | kubeadmin | undefined | undefined | NULL | NULL | approved | {administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances,kubernetes_administrator} | {zen_administrator_role,iaf-automation-admin,iaf-automation-admin} | enabled | false | true | external | 1698847763787 | 1698938556075 | {"dark_mode": false, "last_session_ended_timestamp": "", "last_session_start_timestamp": "", "realm_name": "OCP"}
(2 rows)
root@zen-metastoredb-2.zen-metastoredb:26257/zen> select * from platform_users;
uid | username | displayName | email | salt | password_hash | apikey_hash | apikey_salt | approval_status | permissions | user_roles | current_account_status | internal_user | deletable | authenticator | created_timestamp | last_modified_timestamp | misc
-------------+-----------+-------------+-----------+------------------+----------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+------------------------+---------------+-----------+---------------+-------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------
1000330999 | admin | admin | -- | dc968905f2897a6d | 102c94e4b04a2dd9f9a2cf06b00c77ee84e15dd7a58fd6e83a31ec80dfb03c1dbc9c5ad3700c3cf6bfd48b7cb23341ed0b215c5bc7cf3521e9b7abf0e3b713c8 | NULL | NULL | approved | {administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances,kubernetes_administrator} | {zen_administrator_role,iaf-automation-admin,iaf-automation-admin} | enabled | false | false | default | 1687760004025 | 1698939122922 | {"dark_mode": false, "last_session_ended_timestamp": 1698903053406, "last_session_start_timestamp": 1698903041}
1000331001 | kubeadmin | kubeadmin | kubeadmin | undefined | undefined | NULL | NULL | approved | {administrator,can_provision,monitor_platform,configure_platform,view_platform_health,configure_auth,manage_users,manage_groups,manage_service_instances,kubernetes_administrator} | {zen_administrator_role,iaf-automation-admin,iaf-automation-admin} | enabled | false | true | external | 1687761381292 | 1698939123284 | {"dark_mode": false, "last_session_ended_timestamp": 1698804571653, "last_session_start_timestamp": 1698804562, "sessionInfo": {}}
(2 rows)
And now, all pods have successfully synchronized their data.