OCP 4 Node not ready after cluster upgrade or node restart

Danang Priabada

2 min readMay 22, 2024

Issue

Node is in the NotReady state after the cluster was upgraded
Node is not becoming ready after node reboot
Container runtime (crio) on the node is not working properly
Unable to get a debug shell on the node using oc debug node/<node-name> because container runtime (crio) is not working
Cannot generate sosreport from the node because container runtime (crio) is not working

Resolution

The container runtime needs to manually cleaned up and restarted.
(Note: The following steps will delete all ephemeral storage that stores container images and container runtime storage.)

Cordon the node (to avoid any workload getting scheduled if the node gets ready) and then drain the node.

# oc adm cordon node1.example.com

# oc adm drain node1.example.com \
    --force=true --ignore-daemonsets --delete-emptydir-data --timeout=60s

Note older versions use --delete-local-data instead of --delete-emptydir-data

2. Reboot the node and wait for it to come back. Observe the node status again.

# ssh core@node1.example.com
# sudo -i
# systemctl reboot

3. SSH into the node and become user root

SSH with the public key provided during the install, or if the password for core was set

# ssh core@node1.example.com
# sudo -i

4. Stop the kubelet service

# systemctl stop kubelet

5. Try manually stopping and removing any running containers/pods using the following commands:

# crictl stopp `crictl pods -q`        ##  "stopp" with two "p" for stopping pods
# crictl stop `crictl ps -aq`
# crictl rmp `crictl pods -q`
# crictl rmp --force `crictl pods -q`

6. Stop the crio service:

# systemctl stop crio

Clear the container runtime ephermal storage:

# rm -rf /var/lib/containers/*
# crio wipe -f

Start the crio and kubelet services:

# systemctl start crio
# systemctl start kubelet

If the clean up worked as expected and the crio/kubelet services are started, the node should become ready.
Before marking the node schedulable, collect an sosreport from the node to investigate the root cause.
Mark the node schedulable

# oc adm uncordon <node1>

7. Makesure your node is back to the normal

$ oc adm top nodes
NAME       CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%
node-1     297m         29%       4263Mi          55%
node-0     55m          5%        1201Mi          15%
infra-1    85m          8%        1319Mi          17%
infra-0    182m         18%       2524Mi          32%
master-0   178m         8%        2584Mi          16%

If the node is normal, it should be have CPU activity

Resource :

OCP 4 Node not ready after cluster upgrade or node restart — Red Hat Customer Portal

Verifying node health — Troubleshooting | Support | OpenShift Container Platform 4.15

Troubleshooting CRI-O container runtime issues — Troubleshooting | Support | OpenShift Container Platform 4.15

OCP 4 Node not ready after cluster upgrade or node restart

Issue

Resolution

Written by Danang Priabada

No responses yet