ozone HDDS-6449. Failed container delete can leave artifacts on disk

What changes were proposed in this pull request?

In case there is a failure during a container delete there will be artifact leftovers on disk. To avoid that, containers will be renamed to a new location under a tmp directory and will be deleted from that location. This solution only works if SchemaV3 is enabled. For previous Schema versions, delete operation is exactly the same.

To facilitate this implementation, a helper class called CleanUpManager was created which initializes the tmp directory on each disk under the path /<volume>/<clusterId>/tmp/container_delete_service and handles all operations related to this directory. Before every container delete, if SchemaV3 is enabled then we rename the container, set the path to the new location and then delete it from there.

On datanode startup and shutdown we are checking the tmp directory for any leftovers and then clean it up. The datanode startup hook is in VersionEndpointTask.checkVolumeSet() and the shutdown is in HddsDatanodeService.stop().

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6449

How was this patch tested?

This patch was tested with new unit tests. We added TestContainerPersistence.testDeleteContainerWithSchemaV3Enabled() for testing CleanUpManager and TestKeyValueHandler.testDeleteContainerWithSchemaV3Enabled() for testing KeyValueHandler.deleteInternal(). In addition, we tested cleaning the tmp directory on datanode startup in TestEndPoint.testGetVersionTask() and on datanode shutdown in TestHddsDatanodeService.

There were some issues with the existing code for TestContainerPersistence.testDeleteContainer(). The code block that was expected to throw an non-empty container exception, was never executed while the method was exiting upon the first exception thrown. Furthermore, the command that was used for deleting the container wasn’t checking if it was empty or not. In order to fix that, the method was splitted into two new ones and KeyValueHandler.deleteContainer() was used instead of container.delete().

Due to the changes related to SchemaV3 some tests under container-service package were failing, so we had to make small changes to fix their setups.

Sep 07 '22 16:09 xBis7

@errose28 Thanks for reviewing this. I will make the changes and modify the tests accordingly. Prior to SchemaV3, the clusterId doesn't appear to be available. We were told that clusterId should be available in the tmp dir path and that was an issue with previous versions.

Sep 18 '22 11:09 xBis7

The cluster ID directory on the datanode has a complicated history, but it is not related to schema V3. For a detailed look see HDDS-5432 and the design document linked there. Basically before SCM HA the cluster ID directory on each volume was actually using the SCM ID. This became a problem when SCM HA was introduced since there are now 3 SCMs, but renaming the directory was an incompatible change. At the end of it all the volume could be formatted in the following ways:

Only cluster ID directory
- The volume was formatted with Ozone 1.3.0
SCM ID directory pointed to by a cluster ID symlink
- The volume was formatted with Ozone <= 1.2.0 and SCM HA was finalized.
Only SCM ID
- The volume was formatted with Ozone <= 1.2.0 and SCM HA has not yet been finalized.
- This would happen on the upgrade path from 1.1.0 to 1.3.0

In general you should not need to worry about this. Just call VersionedDatanodeFeatures#ScmHA#chooseContainerPathID and use the result as the parent directory.

Sep 19 '22 20:09 errose28

One other thing to check is what happens with schema V3 when the container is moved to the tmp dir but the RocksDB update fails. Will the datanode restart cleanly? What happens to the RocksDB entries for that container?

Sep 19 '22 20:09 errose28

@errose28 Thanks for your comments, I've made the changes and also modified the tests to use the new code path. Let me know what you think.

One other thing to check is what happens with schema V3 when the container is moved to the tmp dir but the RocksDB update fails. Will the datanode restart cleanly? What happens to the RocksDB entries for that container?

I will be looking at that next.

Sep 20 '22 18:09 xBis7

Thanks @errose28 for our offline discussions on this container delete / rename change and for the PR review and comments.

One other thing to check is what happens with schema V3 when the container is moved to the tmp dir but the RocksDB update fails.

During the design discussions with @swagle and @nandakumar131 on this case for failed access to db to delete container on container delete operation from, https://github.com/apache/ozone/blob/0e9ea9fba0d78becc393945205fa5450ab31f066/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/KeyValueContainerUtil.java#L149 BlockUtils.removeContainerFromDB(containerData, conf); it was deemed to be unrecoverable. At the time we called it an unrecoverable error and should the access to delete the container from the db fail, then the disk should be replaced.

Right now for that case, we log the error and throw an exception that is translated to a StorageContainerException. Presently, there is no recoverable path for the db container delete failure like was discussed in the design discussions. If it should be recoverable, what do you suggest? @swagle , @nandakumar131 thoughts?

Sep 26 '22 14:09 neils-dev

I've looked through the existing code, and if the RocksDB update fails for schema v3 I think we will be ok with a minor improvement in the current design. The volume may eventually be marked unhealthy but even if it is not I think this error is recoverable.

Schema < V3

For schema < V3 we have the following operations:

Container directory renamed to tmp directory.
Container is removed from in memory container set.
Container is deleted from tmp directory.

We had initially designed to handle a failure in steps 1 or 3. Step 2 is in memory so it should fail. For schema V3, however, we have an extra operation since the volume's RocksDB needs to be updated.

Schema V3

For schema V3 we have the following operations:

Container directory renamed to tmp directory.
Container is removed from in memory container set.
Container's entries are removed from RocksDB.
Container is deleted from tmp directory.

The original design still handles failures in steps 1 and 4 above as intended, but what happens if step 3 fails? The container has been removed from the in memory container set in step 2, so there should be any correctness issues while the datanode is running. On restart, the container set is rebuilt by scanning the available datanode volumes, and container metadata is populated based on the values in RocksDB. Since the container is already in the tmp directory, it will not be seen in this step and will not be loaded in to the container set, so we are still behaving correctly.

This does leave some orphaned entries in the volume's RocksDB, however I think these should be relatively easy to clean up. On the startup/shutdown hooks where we are deleting the containers from the tmp directory, we can check if the container is schema V3. If it is, we should first delete any remaining RocksDB entries it may have (if they exist) and then we can delete the container from the tmp directory. If this RocksDB delete fails the container should be skipped and its delete will happen again on the next startup/shutdown. Since the container move to the tmp directory is atomic, we can check RocksDB for each schema V3 .container file we find in the tmp directory. If the contents of the tmp directory are partially deleted and there is no .container file (maybe a failed delete from the tmp directory in the past) then the RocksDB entries must have already been removed so we can proceed to delete those contents.

Sep 26 '22 21:09 errose28

cc @aswinshakil

Oct 17 '22 16:10 kerneltime

Hi @errose28 , In dev email communication, this PR is required by Ozone 1.3. Currently the release of Ozone-1.3 is blocked on this PR. Could you help continue to review this PR?

Nov 02 '22 14:11 captainzmc

Yes, I'm working on a re-review to get this through.

Nov 03 '22 06:11 errose28

@captainzmc we still need some more work to get this PR ready. I'm not sure it needs to be a release blocker. The bug is if there is an IO error while deleting the container, some pieces may be left behind in the container directory. This will cause the datanode to log a warning on startup but has no impact on functionality. Users concerned about the log message can check the container ID using SCM admin commands to see that it is deleted and manually remove the artifacts safely if they wish.

Nov 05 '22 01:11 errose28

@captainzmc we still need some more work to get this PR ready. I'm not sure it needs to be a release blocker. The bug is if there is an IO error while deleting the container, some pieces may be left behind in the container directory. This will cause the datanode to log a warning on startup but has no impact on functionality. Users concerned about the log message can check the container ID using SCM admin commands to see that it is deleted and manually remove the artifacts safely if they wish.

Thanks to @errose28 for explaining the impact of this bug. It seems that this does not affect the use of the cluster, and the user can manually delete these artifacts.
Hi @neils-dev. The release of 1.3.0 is now nearly two months behind schedule. If this PR takes too long to improve, I recommend removing it from the block list. In addition, Some new bugs may be found after the 1.3.0 release. So we can cherry pick the new bug fix from the master later, and then we can continue to release 1.3.1. What do you think?

Nov 08 '22 13:11 captainzmc

If this PR takes too long to improve, I recommend removing it from the block list. In addition, Some new bugs may be found after the 1.3.0 release. So we can cherry pick the new bug fix from the master later, and then we can continue to release 1.3.1

@captainzmc okay, let's do that.

Nov 08 '22 13:11 xBis7

Thanks @captainzmc, @errose28 , @xBis7 ,

Hi @neils-dev. The release of 1.3.0 is now nearly two months behind schedule... I recommend removing it from the block list

Sure, let's remove it from the block list. Thanks!

Nov 08 '22 15:11 neils-dev

@errose28 I'm done with the changes you requested. I left a few comments based on some details that are unclear to me.

Nov 08 '22 18:11 xBis7

Did you get a chance to try the downgrade to 1.1.0 testing I mentioned earlier?

@errose28 I was able to run this script successfully twice hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/upgrade/test.sh, once with run_test non-rolling-upgrade 1.1.0 1.3.0 and once with run_test non-rolling-upgrade 1.2.1 1.3.0. For some weird reason, I had to install robotframework under /compose/upgrade to get it to work, even though I have it installed in my machine. Apart from that, no issues. I also checked the result logs and I couldn't find any errors related to our changes.

Nov 11 '22 15:11 xBis7

@errose28 thanks for reviewing this. I made the changes and updated the tests. Also, added an scm exception in case moving the container failed. Let me know how it looks.

Dec 02 '22 17:12 xBis7

hey @xBis7 ~ please help take a look some failed test~ thanks!

Jan 30 '23 07:01 DaveTeng0

hey @DaveTeng0, if you are referring to the workflow, it's unrelated to the PR. I got a green workflow for the latest 2 commits on my fork. Check below

cleanup

remove null check before setting tmp dir paths

Jan 30 '23 08:01 xBis7

In the KeyValueHandler.importContainerData, it too tries to cleanup the container should the import fail through container delete. This would be a good candidate for the container move to tmp directory and delete to ensure no artifacts are left in case the delete fails. https://github.com/apache/ozone/blob/6389d01605053a9a5ae5510b67fb2bc6204db9d2/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java#L560

@errose28 , is this what you had in mind for reusing the tmp dir and cleanup for the volume? In this case should the container import fails, it would move the container to the tmp directory, try to delete from db. In the event of failure it should be picked up by the cleanup on datanode shutdown/start and retried.

Feb 01 '23 02:02 neils-dev

@neils-dev I've updated the patch.

Feb 01 '23 15:02 xBis7

@neils-dev I've expanded the test for checking that the container has been removed from the DB during tmp dir cleanup.

Mar 07 '23 16:03 xBis7

@errose28 , we've raised a ticket on an issue we found while performing end to end tests with Container Delete with V3 Schema Rocks db, https://issues.apache.org/jira/browse/HDDS-8173 .

On container delete processed by the KeyValueHandler.deleteInternal and handled by the KeyValueContainerUtil.removeContainer, the rocksdb V3 entries for the container are not cleared. After container delete the container metadata and chunks directories are deleted however rocksdb V3 container table entries are not removed.

Within the filed ticket for that issue a Test is included (patch to TestContainerPersistence) can be used to verify the rocksdb V3 container delete behavior. https://issues.apache.org/jira/secure/attachment/13056380/rocksDBContainerDelete.diff

Schema V2 rocksdb should be unaffected as there is a rocksDB per container. On container delete, the rocksDB is moved to the tmp directory along with the container metadata and chunks directory and deleted.

Mar 15 '23 20:03 neils-dev

Other than the issue raised in HDDS-8173, the end to end testing of container delete with atomic move works without issue. Normal container delete, attempts to remove container entries from rocksdb (schema V3), moves the container directories on the volume to the tmp directory and deletes. Should an error occur either from an exception generated by rocksDB helper functions or from the deleting the tmp directory after move, the delete from rocksdb (schema V3) and delete from tmp is retried on shutdown of datanode and restart.

Mar 15 '23 20:03 neils-dev

@neils-dev please trim "extended description" when merging PR. I think listing all individual commit messages from the PR (which Github does by default) is unnecessary, and in this case it is really overly verbose:

    HDDS-6449. Failed container delete can leave artifacts on disk (#3741)
    
    * cleanUpManager - tmpDir - rename container path - delete container in tmpDir
    
    * TestContainerDeleteWithSchemaV3Enabled improvement, cleanup
    
    * cleanTmpDir on datanode start and shutdown, cleanup
    
    * File.renameTo - rename before deletes if V3 enabled
    
    * datanode service start/stop cleanTmpDir()
    
    * HddsDatanodeService cleanTmpDir()  getting schema
    
    * testing cleanTmpDir on datanode start/stop
    
    * modify /tmp path based on volume
    
    * adding cluster id to /tmp path
    
    * remove flag SchemaV3Enabled - clusterId/scmId on the /tmp path
    
    * move cleanTmpDir startup hook to VersionEndpointTask
    
    * no changes
    
    * checking if schema v3 enabled
    
    * check schema v3 is enabled using configuration in CleanUpManager
    
    * initialize CleanUpManager if schemaV3 - testDeleteContainerWithSchemaV3Enabled execute conditionally
    
    * TestHddsDatanodeService - getTmpDirPath from CleanUpManager
    
    * TestKeyValueHandler testDeleteContainerWithSchemaV3Enabled
    
    * TestContainerPersistence fix
    
    * findbugs errors fixed
    
    * TestEndpoint test clean tmp dir on datanode startup
    
    * TestContainerPersistence fix - datanode startup hook moved
    
    * checkstyle fix
    
    * failing tests fixed

    * cleanup
    
    * cleanup - moving CleanUpManager under common.helpers package
    
    * check instance of containerdata - cleanup
    
    * TestContainerPersistence findbugs error fix
    
    * CleanUpManager.renameDir javadoc comment
    
    * CleanUpManager.setTmpDirPath exception message
    
    * CleanUpManager tmpDirPath made final
    
    * moving tmp dir under HddsVolume
    
    * IntegrationTest error fixed
    
    * deleted tmp files
    
    * TestDatanodeUpgradeToScmHA#testImportContainer fix
    
    * cleanup
    
    * unit tests fixed
    
    * checkstyle fixed
    
    * remove container db failure exception
    
    * db failure rethrow exception
    
    * cleanup
    
    * db failure exception
    
    * container delete operations moved under KeyValueContainerUtil
    
    * workflow errors fixed
    
    * rebasing - TestHddsDatanodeService migrating to junit5
    
    * build errors from rebasing, fixed - KeyValueContainerUtil cleanup
    
    * tests cleaned up
    
    * findbugs error fixed
    
    * null pointer exception for tmp dir during datanode startup fixed
    
    * calling ContainerUtils#getContainerFile from cleanTmpDir - tests updated
    
    * scm exception if container move failed
    
    * rebasing
    
    * createTmpDir if null, while creating delete service dir
    
    * deleteInternal more specific SCE thrown on container rename
    
    * HddsVolume using Paths lib - cleanup
    
    * remove null check before setting tmp dir paths
    
    * cleanup
    
    * logging error on removeContainerFromDB failure
    
    * cleanup
    
    * removing checkTmpDirs() - tests cleanup
    
    * delete test generated directories
    
    * tests cleanup
    
    * test db entries on tmp dir cleanup
    
    * merging errors fixed
    
    * tests cleanup
    
    * reducing test diffs
    
    ---------

May 05 '23 09:05 adoroszlai

ozone ozone copied to clipboard

HDDS-6449. Failed container delete can leave artifacts on disk

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Schema < V3

Schema V3

ozone
ozone copied to clipboard