snap.microceph.osd fails to start when ONE OSD is missing

Open slapcat opened this issue 9 months ago • 1 comments

Issue

Recently doing some maintenance, I removed a disk from a microceph node without purging it from the cluster first. After rebooting the node, the snap.microceph.osd.service failed to start. This meant that none of the OSDs on that node would come up. The service seems to choke and go into an endless loop if one OSD's block device is missing.

Workaround

If you do not plan for the OSD to come back online, you can delete the data directory for that OSD:

sudo rm -r /var/snap/microceph/common/data/osd/ceph-<osd.id>

Then restart snap.microceph.osd.service. From there the other OSDs will come online and you can then purge the OSD from the ceph cluster.

Root cause

My colleague @MggMuggins believes the issue lies in this function: https://github.com/canonical/microceph/blob/main/snapcraft/commands/osd.start#L38

Mar 07 '25 18:03 slapcat

Thank you for reporting your feedback to us!

The internal ticket has been created: https://warthogs.atlassian.net/browse/CEPH-1193.

This message was autogenerated

Mar 07 '25 18:03 syncronize-issues-to-jira[bot]