Maintenance mode enter pre-flight check does not account for other nodes in maintenance mode
The checks for microceph cluster maintenance enter do not check if other nodes are already in maintenance mode. This means you can currently successfully run:
microceph cluster maintenance enter node-1
microceph cluster maintenance enter node-2
... all the nodes
There are checks already to try to ensure enough nodes remain active to retain quorum, but they don't account for other nodes being in maintenance mode.
This can be misleading, as in practice, it's risky to have a majority of nodes in maintenance mode.
Thank you for reporting your feedback to us!
The internal ticket has been created: https://warthogs.atlassian.net/browse/CEPH-1307.
This message was autogenerated
Good catch, for a future traveller tools like ceph report could be used to get a unified view of active ceph services. @samuelallan72 wdyt ?
@UtkarshBhatthere I could be wrong, but doesn't the new maintenance mode by default simply mark it as in maintenance, but not stop any services? So ceph report would still show the nodes as active. Perhaps we need something specific to microceph keep the maintenance mode in/out state?
The above case is ~~impossible~~ for microceph without --force currently (see test case) since the number of mon correlates to the number of node (up to maximum of 3 mons unless manually add more). Validate the NonOsdSvcs is the same as validating the number of node.
(Edit, it's possible if you have more than 3 nodes to begin with, we should definitely improve that part)
In addition,
Maintenance mode enter pre-flight check does not account for other nodes in maintenance mode
is a design decision of being idempotent (but we can improve that).
What we could do or thought of doing is adding GET method to /ops/maintenance/{node} API endpoint to retrieve (or re-validate) the maintenance status of a particular node.