OCP 4 Node not ready after cluster upgrade or node restart

Danang Priabada
2 min readMay 22, 2024

--

Issue

  • Node is in the NotReady state after the cluster was upgraded
  • Node is not becoming ready after node reboot
  • Container runtime (crio) on the node is not working properly
  • Unable to get a debug shell on the node using oc debug node/<node-name> because container runtime (crio) is not working
  • Cannot generate sosreport from the node because container runtime (crio) is not working

Resolution

  • The container runtime needs to manually cleaned up and restarted.
    (Note: The following steps will delete all ephemeral storage that stores container images and container runtime storage.)
  1. Cordon the node (to avoid any workload getting scheduled if the node gets ready) and then drain the node.
# oc adm cordon node1.example.com

# oc adm drain node1.example.com \
--force=true --ignore-daemonsets --delete-emptydir-data --timeout=60s

Note older versions use --delete-local-data instead of --delete-emptydir-data

2. Reboot the node and wait for it to come back. Observe the node status again.

# ssh core@node1.example.com
# sudo -i
# systemctl reboot

3. SSH into the node and become user root

  • SSH with the public key provided during the install, or if the password for core was set
# ssh core@node1.example.com
# sudo -i
  • Login from the console on baremetal or the vSphere admin console if the core password was set

4. Stop the kubelet service

# systemctl stop kubelet

5. Try manually stopping and removing any running containers/pods using the following commands:

# crictl stopp `crictl pods -q`        ##  "stopp" with two "p" for stopping pods
# crictl stop `crictl ps -aq`
# crictl rmp `crictl pods -q`
# crictl rmp --force `crictl pods -q`

6. Stop the crio service:

# systemctl stop crio

Clear the container runtime ephermal storage:

# rm -rf /var/lib/containers/*
# crio wipe -f

Start the crio and kubelet services:

# systemctl start crio
# systemctl start kubelet
  • If the clean up worked as expected and the crio/kubelet services are started, the node should become ready.
  • Before marking the node schedulable, collect an sosreport from the node to investigate the root cause.
  • Mark the node schedulable
# oc adm uncordon <node1>

7. Makesure your node is back to the normal

$ oc adm top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
node-1 297m 29% 4263Mi 55%
node-0 55m 5% 1201Mi 15%
infra-1 85m 8% 1319Mi 17%
infra-0 182m 18% 2524Mi 32%
master-0 178m 8% 2584Mi 16%

If the node is normal, it should be have CPU activity

Resource :

OCP 4 Node not ready after cluster upgrade or node restart — Red Hat Customer Portal

Verifying node health — Troubleshooting | Support | OpenShift Container Platform 4.15

Troubleshooting CRI-O container runtime issues — Troubleshooting | Support | OpenShift Container Platform 4.15

--

--

Danang Priabada
Danang Priabada

Written by Danang Priabada

Red Hat and IBM Product Specialist | JPN : プリアバダ ダナン | CHN : 逹男 | linktr.ee/danangpriabada

No responses yet