ozone HDDS-10462. Fail Datanode Decommission Early

What changes were proposed in this pull request?

Many users test out Ozone in small clusters of 15 Datanodes or less. If a 15 DN cluster has some EC 10-4 containers, for example, then it's not possible to decommission more than 1 Datanode, because EC 10-4 requires at least 14 Datanodes. If someone tries to decommission 2 Datanodes, the decommissioning process is designed such that it will keep looping and checking every 30 seconds whether it's possible to take the two DNs offline. It will never fail, even though it's clearly not possible to take the Datanodes offline. In this PR, the decommission is failed early if sufficient datanodes are not available based on the maximum replication factor of containers present in the cluster. It returns a corresponding DatanodeAdminError. The detailed design doc can be found in the EPIC HDDS-10461

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10462

How was this patch tested?

Added unit tests in TestNodeDecommissionManager. Tested manually in docker cluster. In the following cluster, RATIS-THREE is the maximum replication and it has 5 DNS:

bash-4.2$ ozone admin datanode decommission ozone-datanode-1 ozone-datanode-2 ozone-datanode-3
Started decommissioning datanode(s):
ozone-datanode-1
ozone-datanode-2
ozone-datanode-3
Error: AllHosts: Sufficient nodes are not available.
Some nodes could not enter the decommission workflow

bash-4.2$ ozone admin datanode decommission ozone-datanode-1                                  
Started decommissioning datanode(s):
ozone-datanode-1

bash-4.2$ ozone admin datanode decommission ozone-datanode-1 ozone-datanode-2                 
Started decommissioning datanode(s):
ozone-datanode-1
ozone-datanode-2

bash-4.2$ ozone admin datanode decommission ozone-datanode-3
Started decommissioning datanode(s):
ozone-datanode-3
Error: AllHosts: Sufficient nodes are not available.
Some nodes could not enter the decommission workflow

bash-4.2$ ozone admin datanode decommission ozone-datanode-3 --force
Started decommissioning datanode(s):
ozone-datanode-3

Mar 12 '24 06:03 Tejaskriya

but a user should be able to toggle this off using the decommission command.

Yea I think we need a --force flag or something to tell it to ignore the check. The reason is that a cluster could have a handful of EC 10-4, but some nodes need to be taken out anyway. The alternative to decommission is to just stop them, and that has to be done one node at a time and wait for replication to complete, so its slow.

If we have the --force flag, then it would gracefully decommission as much as it could, and all the Ratis or smaller EC replication ones. Then it would get stuck and the admin could make a decision to just stop them, knowing the remaining EC containers will not be able to be replicated.

Mar 21 '24 14:03 sodonnel

@sodonnel @siddhantsangwan Thank you for the reviews! I have addressed the comments left. The check is now performed if the --force flag is not set. Could you please take a look at the current changes?

Mar 26 '24 07:03 Tejaskriya

@Tejaskriya Is this ready for review?

Mar 28 '24 05:03 siddhantsangwan

@Tejaskriya Is this ready for review?

@siddhantsangwan it is ready for review now. I have added few more checks in the same test cases with force flag set to true. The behaviour is as expected. I have also tested it manually in docker set-up, I have updated the results in the PR description under "How was this patch tested"

Mar 28 '24 06:03 Tejaskriya

Thanks everyone. Merged to master.

Apr 02 '24 08:04 siddhantsangwan