Perbaikan Node yang mengalami degraded pada OCP 4.x

7 min readJun 26, 2023

Artikel ini bersifat pengalam pribadi yang saya coba bagi ke internet, ada kondisi dimana satu atau dua node pada ocp mengalami perubahan version dengan sendirinya.

Kondisi yang saya alami seperti ini.

[root@bastion ~]# oc get no
NAME                                     STATUS                     ROLES          AGE    VERSION
csworker-1.devqa-kcln.ocp.hq.example.co.id   Ready                      worker         172d   v1.23.12+a57ef08
csworker-2.devqa-kcln.ocp.hq.example.co.id   Ready                      worker         172d   v1.23.12+a57ef08
csworker-3.devqa-kcln.ocp.hq.example.co.id   Ready                      worker         172d   v1.23.12+a57ef08
infra-1.devqa-kcln.ocp.hq.example.co.id      Ready                      infra,worker   172d   v1.23.12+a57ef08
infra-2.devqa-kcln.ocp.hq.example.co.id      Ready                      infra,worker   172d   v1.23.12+a57ef08
infra-3.devqa-kcln.ocp.hq.example.co.id      Ready                      infra,worker   172d   v1.23.12+a57ef08
master-1.devqa-kcln.ocp.hq.example.co.id     Ready                      master         173d   v1.23.12+a57ef08
master-2.devqa-kcln.ocp.hq.example.co.id     Ready                      master         173d   v1.23.12+a57ef08
master-3.devqa-kcln.ocp.hq.example.co.id     Ready                      master         173d   v1.23.12+a57ef08
odf-1.devqa-kcln.ocp.hq.example.co.id        Ready                      infra,worker   53d    v1.23.12+a57ef08
odf-2.devqa-kcln.ocp.hq.example.co.id        Ready                      infra,worker   172d   v1.23.12+a57ef08
odf-3.devqa-kcln.ocp.hq.example.co.id        Ready,SchedulingDisabled   infra,worker   172d   v1.23.5+8471591
worker-1.devqa-kcln.ocp.hq.example.co.id     Ready                      worker         172d   v1.23.12+a57ef08
worker-2.devqa-kcln.ocp.hq.example.co.id     Ready                      worker         172d   v1.23.12+a57ef08

[root@bastion ~]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.55   True        False         15d     Error while reconciling 4.10.55: the cluster operator machine-config is degraded

Version node yang seharusnya terinstall adalah v1.23.12+a57ef08 sedangkan ada satu node yang memiliki version berbeda yaitu v1.23.5+8471591. Untuk alasan tepat nya kenapa ada satu atau beberapa node mengalamai perubahan version dengan sendirinya, saya pribadi masih belum bisa memastikan.

Dari pengalaman yang saya alami sendiri, perubahan version ini terjadi saat setelah melakukan cordon node. Entah kenapa system memilih version yang tidak sesuai dengan yang seharusnya, saya juga belum tau pasti dan masih berusaha untuk mencari tahu.

Selain version yang tidak sesuai, status dari node terkait, dalam keadaan Ready tapi Scheduling Disabled. Saya pribadi sudah melakukan beberapa kali cordon serta uncordon ulang untuk mengembalikan kondisi node seperti sediakala, namun cordon-uncordon saja tidak cukup. Ini berdampak pada pengecekan cluster version yang menunjukkan error pada status nya.

Setelah saya coba cari penyelesaian dan membaca beberapa solusi yang ada di internet, saya putuskan untuk melakukan experiment, saya menemukan pola yang bisa di terapkan. Dari solusi-solusi di bawah ini.

https://access.redhat.com/solutions/6801291
https://access.redhat.com/solutions/3562201
https://access.redhat.com/solutions/5598401

Ide dari solusi yang coba saya implementasikan adalah, mencoba untuk melakukan perubahan di level configurasi node, dimana saya coba untuk merubah konfiguras /usr/local/bin/configure-ovs.sh dan /etc/machine-config-daemon/currentconfig menjadi kondisi awal (seperti awal instalasi).

Baca lebih lanjut terkait Open vSwitch.

Pertama kita perlu membuat satu debug pods pada spesifik node yang akan kita perbaiki, hal ini perlu dilakukan untuk melakukan perubahan di level configurasi node yang lebih rendah.

Di sini kita akan menghapus configure-ovs.sh dan currentconfig pada node yang rusak. Di sini kita perlu membuka dua terminal sekaligus, terminal pertama kita akan gunakan untuk menjalankan debug node, dan terminal kedua kita gunakan untuk melihat log yang terjadi di sisi node dan bebug node.

Sebelum kita mulai, kita perlu mengetahui daemon pods terkait machine config yang di gunakan oleh node yang mengalami kerusakan, kita perlu melakukan pengecekan pods yang ada di dalam namespace openshift-machine-config-operator.

[root@bastion ~]# oc get pods -n openshift-machine-config-operator -o wide
NAME                                         READY   STATUS    RESTARTS   AGE   IP               NODE                                     NOMINATED NODE   READINESS GATES
machine-config-controller-78dd7465d5-k2gmb   1/1     Running   0          18d   10.128.0.36      master-1.devqa-kcln.ocp.hq.example.co.id     <none>           <none>
machine-config-daemon-2w5wq                  2/2     Running   2          18d   192.168.178.9    csworker-2.devqa-kcln.ocp.hq.example.co.id   <none>           <none>
machine-config-daemon-5d9m9                  2/2     Running   2          18d   192.168.178.5    infra-1.devqa-kcln.ocp.hq.example.co.id      <none>           <none>
machine-config-daemon-6qj5b                  2/2     Running   2          18d   192.168.178.2    master-2.devqa-kcln.ocp.hq.example.co.id     <none>           <none>
machine-config-daemon-6xtnc                  2/2     Running   2          18d   192.168.178.3    master-3.devqa-kcln.ocp.hq.example.co.id     <none>           <none>
machine-config-daemon-hpgqz                  2/2     Running   2          18d   192.168.178.12   worker-2.devqa-kcln.ocp.hq.example.co.id     <none>           <none>
machine-config-daemon-j9g9k                  2/2     Running   2          18d   192.168.178.1    master-1.devqa-kcln.ocp.hq.example.co.id     <none>           <none>
machine-config-daemon-ktx99                  2/2     Running   2          18d   192.168.178.7    infra-3.devqa-kcln.ocp.hq.example.co.id      <none>           <none>
machine-config-daemon-mbf7t                  2/2     Running   4          18d   192.168.178.13   odf-1.devqa-kcln.ocp.hq.example.co.id        <none>           <none>
machine-config-daemon-mjgsg                  2/2     Running   2          18d   192.168.178.10   csworker-3.devqa-kcln.ocp.hq.example.co.id   <none>           <none>
machine-config-daemon-mpf5g                  2/2     Running   4          18d   192.168.178.15   odf-3.devqa-kcln.ocp.hq.example.co.id        <none>           <none>
machine-config-daemon-mw49m                  2/2     Running   2          18d   192.168.178.6    infra-2.devqa-kcln.ocp.hq.example.co.id      <none>           <none>
machine-config-daemon-p6khd                  2/2     Running   4          18d   192.168.178.14   odf-2.devqa-kcln.ocp.hq.example.co.id        <none>           <none>
machine-config-daemon-r2587                  2/2     Running   2          18d   192.168.178.11   worker-1.devqa-kcln.ocp.hq.example.co.id     <none>           <none>
machine-config-daemon-s9thr                  2/2     Running   2          18d   192.168.178.8    csworker-1.devqa-kcln.ocp.hq.example.co.id   <none>           <none>
machine-config-operator-5ffd6f8f8-2jrnm      1/1     Running   0          18d   10.130.0.33      master-3.devqa-kcln.ocp.hq.example.co.id     <none>           <none>
machine-config-server-t9gbt                  1/1     Running   1          18d   192.168.178.2    master-2.devqa-kcln.ocp.hq.example.co.id     <none>           <none>
machine-config-server-vdxpt                  1/1     Running   1          18d   192.168.178.3    master-3.devqa-kcln.ocp.hq.example.co.id     <none>           <none>
machine-config-server-zzr5s                  1/1     Running   1          18d   192.168.178.1    master-1.devqa-kcln.ocp.hq.example.co.id     <none>           <none>

Kita bisa lihat dari hasil command di atas, pods untuk machine configuration yang digunakan oleh odf-3 adalah machine-config-daemon-mpf5g.

Setelah kita mengetahui daemon mana yang akan kita perbaiki, kita mulai untuk membuat debug pods untuk odf-3.

Terminal 1

[root@bastion ~]# oc debug node/odf-3.devqa-kcln.ocp.hq.example.co.id
Starting pod/odf-3devqa-kclnocphqexamplecoid-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.178.15
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# rm /usr/local/bin/configure-ovs.sh
removed '/usr/local/bin/configure-ovs.sh'
sh-4.4# rm /etc/machine-config-daemon/currentconfig
removed '/etc/machine-config-daemon/currentconfig'
sh-4.4# touch /run/machine-config-daemon-force

Saat setelah kita lakukan touch /run/machine-config-daemon-force lakukan pengecekan log. Lakukan pengecekan di dalam pods machine-config-daemon-mpf5g. Buka terminal kedua, dan lakukan pengecekan log.

Terminal 2

[root@bastion ~]# oc logs machine-config-daemon-mpf5g -n openshift-machine-config-operator -c machine-config-daemon | tail -3
I0626 12:26:38.125507    2161 update.go:464] Setting post config change action to postConfigChangeActionReboot; /run/machine-config-daemon-force present
I0626 12:26:38.125570    2161 drain.go:44] Initiating cordon on node (currently schedulable: true)
I0626 12:26:38.139815    2161 drain.go:62] RunCordonOrUncordon() succeeded but node is still not in cordon state, retrying

Tunggu beberapa saat, dan lakukan pengecekan log pada terminal 2.

Terminal 2

[root@bastion ~]# oc logs machine-config-daemon-mpf5g -n openshift-machine-config-operator -c machine-config-daemon | tail -3
I0626 12:26:49.026327    2161 daemon.go:335] evicting pod openshift-storage/rook-ceph-osd-2-6497fb546-6scpn
I0626 12:26:49.026397    2161 daemon.go:335] evicting pod openshift-storage/rook-ceph-mon-d-56bfd54585-95dtm
E0626 12:26:49.039924    2161 daemon.go:335] error when evicting pods/"rook-ceph-osd-2-6497fb546-6scpn" -n "openshift-storage" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Setiap kali kita menemukan pesan “error when evicting pods” lakukan drain secara paksa untuk node bersangkutan. Hal ini bertujuan untuk mengosongkan node dari pod, jadi sebelum melakukan pemeliharaan kita perlu memindahkan semua pod yang di jadwalkan ke node yang rusak ke node lain yang masih berada dalam satu cluster.

Terminal 1

[root@bastion ~]# oc adm drain odf-3.devqa-kcln.ocp.hq.example.co.id  --ignore-daemonsets=true --force=true --grace-period=0 --disable-eviction=true --delete-emptydir-data
node/odf-3.devqa-kcln.ocp.hq.example.co.id already cordoned
WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-8c7cp, openshift-dns/node-resolver-lsqbz, openshift-image-registry/node-ca-5fgdb, openshift-local-storage/diskmaker-discovery-wc2zf, openshift-local-storage/diskmaker-manager-25lmm, openshift-machine-config-operator/machine-config-daemon-mpf5g, openshift-monitoring/node-exporter-5p2zr, openshift-multus/multus-additional-cni-plugins-d44nm, openshift-multus/multus-sms49, openshift-multus/network-metrics-daemon-hc5qh, openshift-network-diagnostics/network-check-target-mqln4, openshift-sdn/sdn-85ngh, openshift-storage/csi-cephfsplugin-2kr6w, openshift-storage/csi-rbdplugin-5nd6p
pod/rook-ceph-osd-2-6497fb546-6scpn deleted
node/odf-3.devqa-kcln.ocp.hq.example.co.id drained

Setelah kita melakukan drain lakukan pengecekan log untuk machine config odf-3.

Terminal 2

[root@bastion ~]# oc logs machine-config-daemon-mpf5g -n openshift-machine-config-operator -c machine-config-daemon | tail -3
I0626 12:28:20.704848    2095 update.go:1965] Update completed for config rendered-worker-e7d04c3556eefa4d851aebcc332c93e8 and node has been successfully uncordoned
I0626 12:28:20.737672    2095 daemon.go:1278] In desired config rendered-worker-e7d04c3556eefa4d851aebcc332c93e8
I0626 12:28:20.757042    2095 config_drift_monitor.go:240] Config Drift Monitor started

Tunggu beberapa menit, dan lakukan pengecekan kondisi cluster.

Terminal 1

[root@bastion ~]# oc get no
NAME                                     STATUS   ROLES          AGE    VERSION
csworker-1.devqa-kcln.ocp.hq.example.co.id   Ready    worker         173d   v1.23.12+a57ef08
csworker-2.devqa-kcln.ocp.hq.example.co.id   Ready    worker         173d   v1.23.12+a57ef08
csworker-3.devqa-kcln.ocp.hq.example.co.id   Ready    worker         173d   v1.23.12+a57ef08
infra-1.devqa-kcln.ocp.hq.example.co.id      Ready    infra,worker   172d   v1.23.12+a57ef08
infra-2.devqa-kcln.ocp.hq.example.co.id      Ready    infra,worker   172d   v1.23.12+a57ef08
infra-3.devqa-kcln.ocp.hq.example.co.id      Ready    infra,worker   172d   v1.23.12+a57ef08
master-1.devqa-kcln.ocp.hq.example.co.id     Ready    master         173d   v1.23.12+a57ef08
master-2.devqa-kcln.ocp.hq.example.co.id     Ready    master         173d   v1.23.12+a57ef08
master-3.devqa-kcln.ocp.hq.example.co.id     Ready    master         173d   v1.23.12+a57ef08
odf-1.devqa-kcln.ocp.hq.example.co.id        Ready    infra,worker   53d    v1.23.12+a57ef08
odf-2.devqa-kcln.ocp.hq.example.co.id        Ready    infra,worker   172d   v1.23.12+a57ef08
odf-3.devqa-kcln.ocp.hq.example.co.id        Ready    infra,worker   172d   v1.23.12+a57ef08
worker-1.devqa-kcln.ocp.hq.example.co.id     Ready    worker         173d   v1.23.12+a57ef08
worker-2.devqa-kcln.ocp.hq.example.co.id     Ready    worker         173d   v1.23.12+a57ef08
[root@bastion ]# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.55   True        False         False      36h
baremetal                                  4.10.55   True        False         False      173d
cloud-controller-manager                   4.10.55   True        False         False      173d
cloud-credential                           4.10.55   True        False         False      173d
cluster-autoscaler                         4.10.55   True        False         False      173d
config-operator                            4.10.55   True        False         False      173d
console                                    4.10.55   True        False         False      18d
csi-snapshot-controller                    4.10.55   True        False         False      173d
dns                                        4.10.55   True        False         False      173d
etcd                                       4.10.55   True        False         False      173d
image-registry                             4.10.55   True        False         False      173d
ingress                                    4.10.55   True        False         False      18d
insights                                   4.10.55   True        False         False      173d
kube-apiserver                             4.10.55   True        False         False      173d
kube-controller-manager                    4.10.55   True        False         False      173d
kube-scheduler                             4.10.55   True        False         False      173d
kube-storage-version-migrator              4.10.55   True        False         False      18d
machine-api                                4.10.55   True        False         False      173d
machine-approver                           4.10.55   True        False         False      173d
machine-config                             4.10.55   True        False         False      16d
marketplace                                4.10.55   True        False         False      173d
monitoring                                 4.10.55   True        False         False      18d
network                                    4.10.55   True        False         False      173d
node-tuning                                4.10.55   True        False         False      89m
openshift-apiserver                        4.10.55   True        False         False      18d
openshift-controller-manager               4.10.55   True        False         False      2d10h
openshift-samples                          4.10.55   True        False         False      18d
operator-lifecycle-manager                 4.10.55   True        False         False      173d
operator-lifecycle-manager-catalog         4.10.55   True        False         False      173d
operator-lifecycle-manager-packageserver   4.10.55   True        False         False      172d
service-ca                                 4.10.55   True        False         False      173d
storage                                    4.10.55   True        False         False      18d
[root@bastion ]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.55   True        False         18d     Cluster version is 4.10.55

Kita bisa lihat kondisi cluster sudah kembali kedalam keadaan normal. Memang terlihat sederhana solusi nya, tapi tidak sesederhana itu saat mengalami nya sendiri.

Kita perlu melakukan gather log dan melakukan debugging aktifitas yang terjadi di dalam cluster, dan mencari aktifitas yang menyalahi behaviour seharusnya.

Perbaikan Node yang mengalami degraded pada OCP 4.x

Written by Danang Priabada

No responses yet