How to solved ceph-mgr error “1 daemons have recently crashed”

Danang Priabada
4 min readJun 26, 2024

--

The error message "1 daemons have recently crashed" usually indicates that one of the background processes (or daemons) related to a software application, cluster, or system service has unexpectedly stopped working. This can happen for various reasons depending on the context.

Some times, there is network glitch that make your cluster connection is failed. And impacted to the cluster, in this case the storage cluster is crashed more than 15 minutes.

The OCP Cluster will send warning message that notify us to check the ceph status.

First step we need to check the ceph status, from the ceph console we can find the details information related to the warning message.

Run from the Bastion server

oc rsh -n openshift-storage $(oc get pods -n openshift-storage -o name -l app=rook-ceph-operator)

From the shell, export the openshift storage configuration

sh-5.1$ export CEPH_ARGS='-c /var/lib/rook/openshift-storage/openshift-storage.config'

And then execute the ceph command.

sh-5.1$ ceph -s
cluster:
id: 85062931-0052-402c-a2d3-515ccd876ad4
health: HEALTH_WARN
1 daemons have recently crashed

services:
mon: 3 daemons, quorum b,d,e (age 9h)
mgr: a(active, since 4d)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 4w), 3 in (since 3M)
rgw: 1 daemon active (1 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 12 pools, 201 pgs
objects: 7.69k objects, 26 GiB
usage: 75 GiB used, 1.4 TiB / 1.5 TiB avail
pgs: 201 active+clean

io:
client: 853 B/s rd, 186 KiB/s wr, 1 op/s rd, 13 op/s wr

sh-5.1$

We can see, from the status it shown warning message that said 1 daemons have recently crashed. This means one or more Ceph daemons has crashed recently, and the crash has not yet been archived (acknowledged) by the administrator. This may indicate a software bug, a hardware problem (e.g., a failing disk), or some other problem.

New crashes can be listed with:

sh-5.1$ ceph crash ls-new
ID ENTITY NEW
2024-06-21T17:07:42.188344Z_e733e959-7cf6-4958-8b10-8393c972929c client.ceph-exporter *

Information about a specific crash can be examined with:

sh-5.1$ ceph crash info <crash-id>
sh-5.1$ ceph crash info 2024-06-21T17:07:42.188344Z_e733e959-7cf6-4958-8b10-8393c972929c
{
"backtrace": [
"/lib64/libc.so.6(+0x54db0) [0x7faeab75bdb0]",
"/lib64/libc.so.6(+0xa154c) [0x7faeab7a854c]",
"raise()",
"abort()",
"/lib64/libstdc++.so.6(+0xa1a01) [0x7faeab9cca01]",
"/lib64/libstdc++.so.6(+0xad37c) [0x7faeab9d837c]",
"/lib64/libstdc++.so.6(+0xad3e7) [0x7faeab9d83e7]",
"/lib64/libstdc++.so.6(+0xad649) [0x7faeab9d8649]",
"ceph-exporter(+0x2976d) [0x55b28344c76d]",
"(boost::json::detail::throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x55b283460297]",
"ceph-exporter(+0x65a37) [0x55b283488a37]",
"(DaemonMetricCollector::dump_asok_metrics()+0x1ea3) [0x55b283468ad3]",
"ceph-exporter(+0x45ef0) [0x55b283468ef0]",
"ceph-exporter(+0x5cbfd) [0x55b28347fbfd]",
"ceph-exporter(+0xab7ff) [0x55b2834ce7ff]",
"(DaemonMetricCollector::main()+0x212) [0x55b283452c22]",
"main()",
"/lib64/libc.so.6(+0x3feb0) [0x7faeab746eb0]",
"__libc_start_main()",
"_start()"
],
"ceph_version": "17.2.6-209.el9cp",
"crash_id": "2024-06-21T17:07:42.188344Z_e733e959-7cf6-4958-8b10-8393c972929c",
"entity_name": "client.ceph-exporter",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.3 (Plow)",
"os_version_id": "9.3",
"process_name": "ceph-exporter",
"stack_sig": "03972c98be910d1ce25645fdd11917d43497d8e45963b63cf072b005e7daee44",
"timestamp": "2024-06-21T17:07:42.188344Z",
"utsname_hostname": "rook-ceph-exporter-odf-2.drc.ocp.bankabc.co.id-68495947df-fqbr7",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-284.55.1.el9_2.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024"
}
sh-5.1$

This warning can be silenced by “archiving” the crash (perhaps after being examined by an administrator) so that it does not generate this warning:

sh-5.1$ ceph crash archive <crash-id>
sh-5.1$ ceph crash archive 2024-06-21T17:07:42.188344Z_e733e959-7cf6-4958-8b10-8393c972929c

Check current status, the ceph is back to the HEALTH_OK status.

sh-5.1$ ceph -s
cluster:
id: 85062931-0052-402c-a2d3-515ccd876ad4
health: HEALTH_OK

services:
mon: 3 daemons, quorum b,d,e (age 9h)
mgr: a(active, since 4d)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 4w), 3 in (since 3M)
rgw: 1 daemon active (1 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 12 pools, 201 pgs
objects: 7.69k objects, 26 GiB
usage: 75 GiB used, 1.4 TiB / 1.5 TiB avail
pgs: 201 active+clean

io:
client: 1.1 KiB/s rd, 118 KiB/s wr, 2 op/s rd, 11 op/s wr

sh-5.1$

Similarly, all new crashes can be archived with:

ceph crash archive-all
  • Archived crashes will still be visible via ceph crash ls but not ceph crash ls-new.
  • The time period for what “recent” means is controlled by the option mgr/crash/warn_recent_interval (default: two weeks).
  • These warnings can be disabled entirely with:
ceph config set mgr mgr/crash/warn_recent_interval 0

Now, you can check the Storage cluster status. The Storage is back to the Normal status.

Reference : https://access.redhat.com/solutions/5506031 (This article add more details steps)

--

--

Danang Priabada
Danang Priabada

Written by Danang Priabada

Red Hat and IBM Product Specialist | JPN : プリアバダ ダナン | CHN : 逹男 | linktr.ee/danangpriabada

No responses yet