OCP 4 Node not ready after cluster upgrade or node restart
Issue
- Node is in the NotReady state after the cluster was upgraded
- Node is not becoming ready after node reboot
- Container runtime (crio) on the node is not working properly
- Unable to get a debug shell on the node using
oc debug node/<node-name>
because container runtime (crio) is not working - Cannot generate sosreport from the node because container runtime (crio) is not working
Resolution
- The container runtime needs to manually cleaned up and restarted.
(Note: The following steps will delete all ephemeral storage that stores container images and container runtime storage.)
- Cordon the node (to avoid any workload getting scheduled if the node gets ready) and then drain the node.
# oc adm cordon node1.example.com
# oc adm drain node1.example.com \
--force=true --ignore-daemonsets --delete-emptydir-data --timeout=60s
Note older versions use --delete-local-data
instead of --delete-emptydir-data
2. Reboot the node and wait for it to come back. Observe the node status again.
# ssh core@node1.example.com
# sudo -i
# systemctl reboot
3. SSH into the node and become user root
- SSH with the public key provided during the install, or if the password for core was set
# ssh core@node1.example.com
# sudo -i
- Login from the console on baremetal or the vSphere admin console if the core password was set
4. Stop the kubelet service
# systemctl stop kubelet
5. Try manually stopping and removing any running containers/pods using the following commands:
# crictl stopp `crictl pods -q` ## "stopp" with two "p" for stopping pods
# crictl stop `crictl ps -aq`
# crictl rmp `crictl pods -q`
# crictl rmp --force `crictl pods -q`
6. Stop the crio service:
# systemctl stop crio
Clear the container runtime ephermal storage:
# rm -rf /var/lib/containers/*
# crio wipe -f
Start the crio and kubelet services:
# systemctl start crio
# systemctl start kubelet
- If the clean up worked as expected and the crio/kubelet services are started, the node should become ready.
- Before marking the node schedulable, collect an sosreport from the node to investigate the root cause.
- Mark the node schedulable
# oc adm uncordon <node1>
7. Makesure your node is back to the normal
$ oc adm top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
node-1 297m 29% 4263Mi 55%
node-0 55m 5% 1201Mi 15%
infra-1 85m 8% 1319Mi 17%
infra-0 182m 18% 2524Mi 32%
master-0 178m 8% 2584Mi 16%
If the node is normal, it should be have CPU activity
Resource :
OCP 4 Node not ready after cluster upgrade or node restart — Red Hat Customer Portal
Verifying node health — Troubleshooting | Support | OpenShift Container Platform 4.15