IBM API Connect Crash Loop OS CronJob
In this article, I wanna share one error that related to the CronJob, this error impact to the analytics subsystem (Elasticsearch) indexing.
CronJob is meant for performing regular scheduled actions such as backups, report generation, and so on. One CronJob object is like one line of a crontab (cron table) file on a Unix system. It runs a Job periodically on a given schedule, written in Cron format.
This is the error that we found :
back-off 2m40s restarting failed container=oscron pod=apiconnect-9ecaa918-oscron-28686405-wtvd8_cp4i(599b02d7-5c94-497f-a7ae-5bdb421c55c7)
CrashLoopBackOff indicates that the application within the container is failing to start properly.
We found there is one pod oscron that got CrashLoopBackOff, in first place we have already tried to remove those pods and the related jobs, but the oscron always re-generate again and again.
Then we tried to see the pod logs and we found this.
# oc logs -f apiconnect-9ecaa918-oscron-28688010-vzcp9
apiconnect-9ecaa918-oscron-28688010-vzcp9 0/1 Error 4 (71s ago) 3m1s 12.123.1.123 csworker-2.dev.ocp.bankabc.co.id <none> <none>
Fetching current index
Making request to: https://apiconnect-9ecaa918-storage:9200/apic-api-w
file:///app/summary-management.js:175
const indexWithWriteAlias = filteredIndices.find((obj) => obj.aliases[OS_WRITE_INDEX].is_write_index);
^
TypeError: Cannot read properties of undefined (reading 'is_write_index')
at file:///app/summary-management.js:175:89
at Array.find (<anonymous>)
at getCurrentWriteIndex (file:///app/summary-management.js:175:47)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async processTransformJobs (file:///app/summary-management.js:428:24)
at async main (file:///app/summary-management.js:512:3)
Node.js v18.19.0
From that log indicate the related pod is complaining of unable to read to the current write alias, which we can see from indices output that the write alias has been turned into an index with name apic-api-w below:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open all-content-data 76o234u0SQeM-G3fi006cA 1 1 131 0 55.7kb 55.7kb
yellow open reports clozT4pQRMaIF68lZqmVtQ 1 1 0 0 208b 208b
green open .plugins-ml-config VG2wrqadTHKUQu8CzdKSUA 1 0 1 0 3.9kb 3.9kb
yellow open all-content-count A35Vk6dIQ-OL6RToONBKEA 1 1 1876 0 252.9kb 252.9kb
yellow open apic-api-w wFc_3JVvTdWMB_nS0en2iw 1 1 1644 0 4.3mb 4.3mb
Action Plan
In an attempt to recover from the issue, please try the below steps in order:
Step 1: Scale ingestion down to 0 replicas
[root@bastion ~]# oc get sts -n cp4i
NAME READY AGE
apiconnect-9ecaa918-ingestion 1/1 132d
apiconnect-9ecaa918-storage 1/1 132d
apiconnect-cd20b410-cd20b410-db 1/1 132d
apiconnect-cd20b410-cd20b410-www 1/1 132d
apiconnect-cd20b410-nginx 1/1 132d
apiconnect-cefeb06f-natscluster 1/1 132d
apiconnect-development-gw 1/1 132d
[root@bastion ~]#
kubectl scale sts <ingestion-sts-name> -n cp4i --replicas=0
kubectl scale sts apiconnect-9ecaa918-ingestion -n cp4i --replicas=0
Wait until 0/0
[root@bastion ~]# oc get sts -n cp4i | grep ingestion
NAME READY AGE
apiconnect-9ecaa918-ingestion 0/0 132d
Step 2: Delete index apic-api-w
Warning: You are going to lose all of the analytics data held under the wrong index i.e. apic-api-w
[root@bastion ~]# oc get po -n cp4i | grep storage
apiconnect-9ecaa918-storage-0 1/1 Running 0 23d
[root@bastion ~]# kubectl exec -it apiconnect-9ecaa918-storage-0 -n cp4i -- bash
bash-4.4$ export CURL_CMD="curl -sk --key /etc/velox/certs/client/tls.key --cert /etc/velox/certs/client/tls.crt https://localhost:9200"
bash-4.4$ $CURL_CMD/apic-api-w -X DELETE
{"acknowledged":true}
bash-4.4$ exit
exit
Step 3: Re-run osinit
[root@bastion ~]# kubectl get jobs -n cp4i | grep osinit
apiconnect-9ecaa918-osinit 1/1 38s 132d
[root@bastion ~]# kubectl delete job apiconnect-9ecaa918-osinit -n cp4i
job.batch "apiconnect-9ecaa918-osinit" deleted
Step 4: Scale ingestion up to original replica count
[root@bastion ~]# kubectl get sts -n cp4i
NAME READY AGE
apiconnect-9ecaa918-ingestion 0/0 132d
apiconnect-9ecaa918-storage 1/1 132d
apiconnect-cd20b410-cd20b410-db 1/1 132d
apiconnect-cd20b410-cd20b410-www 1/1 132d
apiconnect-cd20b410-nginx 1/1 132d
apiconnect-cefeb06f-natscluster 1/1 132d
apiconnect-development-gw 1/1 132d
[root@bastion ~]# kubectl scale sts apiconnect-9ecaa918-ingestion -n cp4i --replicas=1
statefulset.apps/apiconnect-9ecaa918-ingestion scaled
After this wait for 15 — 18 mins and then check the status of your analytics.
The analytics already running properly but with empty data, because on the previous step we delete the wrong index, at this point we lose all of the data that held under that index.
If the oscron pod is not in completed state, then try deleting that oscron job and wait for 15 — 18 mins.
[root@bastion ~]# oc get jobs
NAME COMPLETIONS DURATION AGE
apiconnect-36e322ff-configurator 1/1 10m 132d
apiconnect-9ecaa918-oscron-28693515 0/1 78m 78m
apiconnect-9ecaa918-oscron-28693530 0/1 63m 63m
apiconnect-9ecaa918-oscron-28693545 0/1 48m 48m
apiconnect-9ecaa918-oscron-28693560 0/1 33m 33m
apiconnect-9ecaa918-oscron-28693575 0/1 18m 18m
apiconnect-9ecaa918-oscron-28693590 1/1 23s 3m50s
apiconnect-9ecaa918-osinit 1/1 86s 6m3s
apiconnect-cefeb06f-analytics-push-28565415 0/1 89d 89d
apiconnect-cefeb06f-analytics-push-28693575 1/1 107s 18m
apiconnect-cefeb06f-up-apim-data-populate-0-to-605-f767074c 1/1 2m49s 132d
apiconnect-cefeb06f-up-apim-schema-0-to-605-f767074c 1/1 2m10s 132d
apiconnect-cefeb06f-up-lur-data-populate-0-to-103-f767074c 1/1 103s 132d
apiconnect-cefeb06f-up-lur-schema-0-to-103-f767074c 1/1 21s 132d