snap.microceph.osd fails to start when ONE OSD is missing
Issue
Recently doing some maintenance, I removed a disk from a microceph node without purging it from the cluster first. After rebooting the node, the snap.microceph.osd.service failed to start. This meant that none of the OSDs on that node would come up. The service seems to choke and go into an endless loop if one OSD's block device is missing.
Workaround
If you do not plan for the OSD to come back online, you can delete the data directory for that OSD:
sudo rm -r /var/snap/microceph/common/data/osd/ceph-<osd.id>
Then restart snap.microceph.osd.service. From there the other OSDs will come online and you can then purge the OSD from the ceph cluster.
Root cause
My colleague @MggMuggins believes the issue lies in this function: https://github.com/canonical/microceph/blob/main/snapcraft/commands/osd.start#L38
Thank you for reporting your feedback to us!
The internal ticket has been created: https://warthogs.atlassian.net/browse/CEPH-1193.
This message was autogenerated