ozone HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

What changes were proposed in this pull request?

Background: The async write is still not robust enough, sometimes there will be some uncoverable containers (no healthy replicas) when the cluster load is too high.

Currently, such an unrecoverable ratis container will go through the following process.

DN will mark the container as unhealthy and report it to the SCM.
SCM then tries to close the container, and the container state will be closing.
DN won't close an unhealthy replica.
SCM RM will not send close cmd to those unhealthy containers.

Hence, the unrecoverable container will be stuck in the state of Closing.

After the admin fixes some available data in such containers or just abandons them, these containers shall be closed on purpose.

Under such circumstances, we shall provide a configurable way to clean up these closed containers. After closing the unhealthy container, the unrecoverable container with only unhealthy replicas could be deleted.

This solution could clean up the closed container with any number of replicas( 1, 2, 3). If three replicas are all unhealthy, RM will delete one replica to attempt to recover. The replicas num will decreased to 2. If one or two replicas are unhealthy, RM will walk through this PR's logic and could delete the container.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7099

How was this patch tested?

UT and in production env

Aug 05 '22 03:08 Xushaohong

@duongnguyen0 can you take a look

Aug 08 '22 16:08 kerneltime

What is the expected behavior if there is only one replica left and it is unhealthy? Unhealthy does not imply that there are no customer readable keys that have data in that replica. Deletion, in general, is an unsafe option, and we need to be sure we do not introduce a data loss scenario.

Aug 10 '22 08:08 kerneltime

What is the expected behavior if there is only one replica left and it is unhealthy?

By default, such containers will be left alone and ozone will do nothing to them.

Unhealthy does not imply that there are no customer readable keys that have data in that replica. Deletion, in general, is an unsafe option, and we need to be sure we do not introduce a data loss scenario.

Yes, this is for the case the administrator definitely knows these containers, may have restored part of readable keys, and then need to delete the unrecoverable containers instead of resetting the whole cluster. By enabling the corresponding config, the SCM then will send delete CMDs to DN.

@kerneltime

Aug 10 '22 08:08 Xushaohong

Hi @Xushaohong, I don't think we want to be deleting containers with all replicas unhealthy automatically, because this will cause divergence between the OM's metadata and the corresponding block storage. If all container replicas are unhealthy, there is no way to recover the containers, and the admin would like the containers removed, it would be better for the admin to delete the keys with data in those containers. The keys can be found using Recon's REST API. Here you can query an index mapping container ID to keys with blocks in the container from the /api/v1/containers/:id/keys endpoint. We should double check that this code path works though. I am not sure what happens if delete block commands get queued for unhealthy containers. If there are bugs in this area we should fix them so that deleting all keys with data in an unhealthy container causes the unhealthy container to eventually be deleted.

Aug 15 '22 23:08 errose28

Hi @Xushaohong, I don't think we want to be deleting containers with all replicas unhealthy automatically, because this will cause divergence between the OM's metadata and the corresponding block storage. If all container replicas are unhealthy, there is no way to recover the containers, and the admin would like the containers removed, it would be better for the admin to delete the keys with data in those containers. The keys can be found using Recon's REST API. Here you can query an index mapping container ID to keys with blocks in the container from the /api/v1/containers/:id/keys endpoint. We should double check that this code path works though. I am not sure what happens if delete block commands get queued for unhealthy containers. If there are bugs in this area we should fix them so that deleting all keys with data in an unhealthy container causes the unhealthy container to eventually be deleted.

@errose28 Hi, Ethan. Thx for the reply. Delete container from SCM RM side seems not a strongly reasonable idea. currently, the logic in isDeletionAllowed only permits the closed container to delete blocks, if the container is unhealthy, DN will not process them and hence reported them back to SCM. Can we add the check condition to support the deletion of the unhealthy container?

Aug 16 '22 07:08 Xushaohong

@Xushaohong we discussed this in the weekly open source meeting, and we plan to dive a bit deeper into the overall deletion logic and how this should be operationalized.

Aug 22 '22 16:08 kerneltime

Thanks for raising the issue @Xushaohong. Handling containers where all replicas are in a degenerate state is definitely something the system should improve on. Adding to @kerneltime's response based on other discussions around this issue, it seems the desired solution would be to provide a path to remove containers from the system who have all replicas unhealthy or missing and no keys mapped to them, and that the system should do this automatically without extra configuration. My current understanding of this patch is that it is not doing the two parts in bold. Handling this is going to be a bit involved and may require a design document. I will try to write up some ideas to share out soon.

Aug 22 '22 17:08 errose28

cc @GeorgeJahad

Aug 29 '22 16:08 kerneltime

@errose28, @kerneltime , I've seen the unrecoverable container condition described in this PR as well where the unrecoverable container is always reported to be in the state of 'Closing' in the SCM where the container is reported as 'Missing' by Recon. I brought this up with @errose28 offline. In this case, the datanode goes down causing Recon to update the state of the containers to be 'Unheathly' under refesh with an associated Missing Container. The SCM, however always reports this unrecoverable container in the State - Closing which never changes and misleading to the admin. It would be helpful if the PR handles this case to handle and cleanup such unrecoverable containers. See

attached images:

Recon_missing_container

container_report_closing_container

Aug 29 '22 20:08 neils-dev

Adding to @kerneltime's response based on other discussions around this issue, it seems the desired solution would be to provide a path to remove containers from the system who have all replicas unhealthy or missing and no keys mapped to them, and that the system should do this automatically without extra configuration. My current understanding of this patch is that it is not doing the two parts in bold. Handling this is going to be a bit involved and may require a design document. I will try to write up some ideas to share out soon.

Thx @errose28 for the reply, the auto-detection and cleanup is what we need. Currently, the single component either OM or SCM doesn't have the map of keys to the container. If the service is on SCM, it might need another query to OM to check if the container remains some keys. One concern is that such a map is only available in the recon API, which is not clear enough and not commonly used.

Aug 30 '22 03:08 Xushaohong

We need a more complete patch, close it first

Oct 25 '22 05:10 Xushaohong

ozone ozone copied to clipboard

HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

ozone
ozone copied to clipboard