How to solved ceph-mgr error “1 daemons have recently crashed”

4 min readJun 26, 2024

The error message "1 daemons have recently crashed" usually indicates that one of the background processes (or daemons) related to a software application, cluster, or system service has unexpectedly stopped working. This can happen for various reasons depending on the context.

Some times, there is network glitch that make your cluster connection is failed. And impacted to the cluster, in this case the storage cluster is crashed more than 15 minutes.

The OCP Cluster will send warning message that notify us to check the ceph status.

First step we need to check the ceph status, from the ceph console we can find the details information related to the warning message.

Run from the Bastion server

oc rsh -n openshift-storage $(oc get pods -n openshift-storage -o name -l app=rook-ceph-operator)

From the shell, export the openshift storage configuration

sh-5.1$ export CEPH_ARGS='-c /var/lib/rook/openshift-storage/openshift-storage.config'

And then execute the ceph command.

sh-5.1$ ceph -s
  cluster:
    id:     85062931-0052-402c-a2d3-515ccd876ad4
    health: HEALTH_WARN
            1 daemons have recently crashed

  services:
    mon: 3 daemons, quorum b,d,e (age 9h)
    mgr: a(active, since 4d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 4w), 3 in (since 3M)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 201 pgs
    objects: 7.69k objects, 26 GiB
    usage:   75 GiB used, 1.4 TiB / 1.5 TiB avail
    pgs:     201 active+clean

  io:
    client:   853 B/s rd, 186 KiB/s wr, 1 op/s rd, 13 op/s wr

sh-5.1$

We can see, from the status it shown warning message that said 1 daemons have recently crashed. This means one or more Ceph daemons has crashed recently, and the crash has not yet been archived (acknowledged) by the administrator. This may indicate a software bug, a hardware problem (e.g., a failing disk), or some other problem.

New crashes can be listed with:

sh-5.1$ ceph crash ls-new
ID                                                                ENTITY                NEW
2024-06-21T17:07:42.188344Z_e733e959-7cf6-4958-8b10-8393c972929c  client.ceph-exporter   *

Information about a specific crash can be examined with:

sh-5.1$ ceph crash info <crash-id>
sh-5.1$ ceph crash info 2024-06-21T17:07:42.188344Z_e733e959-7cf6-4958-8b10-8393c972929c
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7faeab75bdb0]",
        "/lib64/libc.so.6(+0xa154c) [0x7faeab7a854c]",
        "raise()",
        "abort()",
        "/lib64/libstdc++.so.6(+0xa1a01) [0x7faeab9cca01]",
        "/lib64/libstdc++.so.6(+0xad37c) [0x7faeab9d837c]",
        "/lib64/libstdc++.so.6(+0xad3e7) [0x7faeab9d83e7]",
        "/lib64/libstdc++.so.6(+0xad649) [0x7faeab9d8649]",
        "ceph-exporter(+0x2976d) [0x55b28344c76d]",
        "(boost::json::detail::throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x55b283460297]",
        "ceph-exporter(+0x65a37) [0x55b283488a37]",
        "(DaemonMetricCollector::dump_asok_metrics()+0x1ea3) [0x55b283468ad3]",
        "ceph-exporter(+0x45ef0) [0x55b283468ef0]",
        "ceph-exporter(+0x5cbfd) [0x55b28347fbfd]",
        "ceph-exporter(+0xab7ff) [0x55b2834ce7ff]",
        "(DaemonMetricCollector::main()+0x212) [0x55b283452c22]",
        "main()",
        "/lib64/libc.so.6(+0x3feb0) [0x7faeab746eb0]",
        "__libc_start_main()",
        "_start()"
    ],
    "ceph_version": "17.2.6-209.el9cp",
    "crash_id": "2024-06-21T17:07:42.188344Z_e733e959-7cf6-4958-8b10-8393c972929c",
    "entity_name": "client.ceph-exporter",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-exporter",
    "stack_sig": "03972c98be910d1ce25645fdd11917d43497d8e45963b63cf072b005e7daee44",
    "timestamp": "2024-06-21T17:07:42.188344Z",
    "utsname_hostname": "rook-ceph-exporter-odf-2.drc.ocp.bankabc.co.id-68495947df-fqbr7",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.55.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024"
}
sh-5.1$

This warning can be silenced by “archiving” the crash (perhaps after being examined by an administrator) so that it does not generate this warning:

sh-5.1$ ceph crash archive <crash-id>
sh-5.1$ ceph crash archive 2024-06-21T17:07:42.188344Z_e733e959-7cf6-4958-8b10-8393c972929c

Check current status, the ceph is back to the HEALTH_OK status.

sh-5.1$ ceph -s
  cluster:
    id:     85062931-0052-402c-a2d3-515ccd876ad4
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum b,d,e (age 9h)
    mgr: a(active, since 4d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 4w), 3 in (since 3M)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 201 pgs
    objects: 7.69k objects, 26 GiB
    usage:   75 GiB used, 1.4 TiB / 1.5 TiB avail
    pgs:     201 active+clean

  io:
    client:   1.1 KiB/s rd, 118 KiB/s wr, 2 op/s rd, 11 op/s wr

sh-5.1$

Similarly, all new crashes can be archived with:

ceph crash archive-all

Archived crashes will still be visible via ceph crash ls but not ceph crash ls-new.
The time period for what “recent” means is controlled by the option mgr/crash/warn_recent_interval (default: two weeks).
These warnings can be disabled entirely with:

ceph config set mgr mgr/crash/warn_recent_interval 0

Now, you can check the Storage cluster status. The Storage is back to the Normal status.

Reference : https://access.redhat.com/solutions/5506031 (This article add more details steps)

How to solved ceph-mgr error “1 daemons have recently crashed”

Written by Danang Priabada

No responses yet