ozone HDDS-7083. Spread container-copy directories

What changes were proposed in this pull request?

The default container-copy directory is "/tmp", when several replication jobs are running, the performance of the disk holding "/tmp" is quite bad, if we also have ratis directory set on the same disk, the ratis work will be affected.

This ticket is to add a configuration to spread the "container-copy" work on different volumes to mitigate this issue.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7083

How was this patch tested?

unit test.

Aug 03 '22 09:08 symious

Thanks for the proposal Symious. I agree we should not use /tmp for intermediate datanode operations like container import. I have a few questions.

Why is this be behind a config key? I don't know of a situation where users would want to turn this feature off.
It looks like the code is randomly choosing volumes. Wouldn't it be better to choose the the volume the container is being imported to?
If a container is imported and there is some kind of error causing the import to be aborted, what happens to the remaining artifacts? Does this prevent the import from being retried?

FYI HDDS-6449 is under active development by @neils-dev, which will add a tmp directory to each volume to be used for intermediate container operations. Containers can be moved out of the directory to their final destination with an atomic rename since they reside on the same file system, and artifacts can be periodically cleaned out of the temp directory on restart or other failures. I think we can implement this after HDDS-6449 to use the same directory.

Aug 03 '22 17:08 errose28

@errose28 Thanks for the review.

Why is this be behind a config key? I don't know of a situation where users would want to turn this feature off.

It's because of the config of "hdds.datanode.replication.work.dir", if we enable the spread config by default, users might feel confused since the value of "hdds.datanode.replication.work.dir" is not used. If the user actively enables the spread config, we assume the user already know "hdds.datanode.replication.work.dir" won't be used.

It looks like the code is randomly choosing volumes. Wouldn't it be better to choose the the volume the container is being imported to?

It's because of the steps of the import process:

download container from other datanodes.
Initial an InputStream based on the downloaded file, and KeyValueContainerHandler will use this InputStream to import the container.

And the destination volume is chosen in step 2, so in step 1, we don't know in which volume the container will be stored. We can also send the download path to the KeyValueContainerHandler, (which requires interface change), but I tested the copy speed of inter-volume and intra-volume, the speed is kind of no difference, so I left the original Handler import process unchanged.

If a container is imported and there is some kind of error causing the import to be aborted, what happens to the remaining artifacts? Does this prevent the import from being retried?

If exceptions happen during the import, the downloaded files are deleted with the handler's implementation.

Aug 04 '22 07:08 symious

Thanks @symious to improve the replication feature. Agree with @errose28 suggestions. We can create a proper sub-directory under each "hdds.datanode.dir" for downloaded containers, instead of introduce a new "hdds.datanode.replication.work.dir" configuration, and the downloaded container will be auto renamed to move to the data directory in the same volume to save the container copy between volume cost.

Consider that HDDS-6449 has quiet a few tasks and the time line is uncertain, maybe we can proceed with this JIRA first and let HDDS-6449 leverage the sub-directory if possible(HDDS-6449 introduces a background deleting service to periodically delete content under the directory, so it might be not a good idea to share the same directory). What do you think @errose28 ?

Aug 09 '22 02:08 ChenSammi

@ChenSammi Thank you for the review. Updated the PR to enable the feature by default.

Aug 09 '22 10:08 symious

Thanks for the input @ChenSammi. The subtasks in HDDS-6449 were actually created before we had design discussions. There won't need to be a background service which greatly simplifies the task and it could probably be done in one Jira. But I agree that the progress on that task is uncertain and we have a proposal to fix the import issue now, so let's go ahead and introduce the new directory here.

Some thoughts on introducing the new directory:

I think we should have one tmp working directory per volume, with different subdirectories for different tasks as you mentioned. Container delete, import, and create would be examples, although we only need to handle import in this jira.
Containers should be moved from the tmp directory to their intended location using a directory rename atomic at the FS level.
IMO hdds.datanode.replication.work.dir should be deprecated and ignored. Datanodes require a clean disk state in their container directories on restart (see HDDS-6441 and HDDS-6449), so any non-atomic FS operation that could compromise that like RocksDB creation (schema < V3) copy from import directory, or delete must be done from an atomic directory rename on the same filesystem.
The tmp directory needs to be chosen with care.
- Currently all volume state is contained in the hdds subdirectory of each volume directory/disk mount. I think this feature should maintain that practice.
- We cannot add a new subdirectory immediately under hdds as this will break downgrade without an upgrade layout feature. Ozone 1.1.0 has a check that hdds directory only has one subdirectory.
- IMO hdds/<clusterID>/tmp would be the best place to put it, but we need to make sure this will not affect datanode startup on downgrade. If it does and there is no better directory, we may need to add a simple HDDS layout feature for this change.

Let me know your thoughts on choosing a location for the tmp directory.

Aug 09 '22 17:08 errose28

* IMO `hdds.datanode.replication.work.dir` should be deprecated and ignored. Datanodes require a clean disk state in their container directories on restart (see [HDDS-6441](https://issues.apache.org/jira/browse/HDDS-6441) and [HDDS-6449](https://issues.apache.org/jira/browse/HDDS-6449)), so any non-atomic FS operation that could compromise that like RocksDB creation (schema < V3) copy from import directory, or delete must be done from an atomic directory rename on the same filesystem.

We can hide this property from ozone-default.xml and comment it as deprecated. Then new Ozone user will not be aware of this property. For existing ozone user, we should consider still honor this property to keep backward compatibility if the implementation will not be too complex.

Let me know your thoughts on choosing a location for the tmp directory.

@errose28 agree, hdds/<clusterID>/tmp is a good choice. BTW, currently how do we verify that datanode downgrade will be impacted or not?

Aug 10 '22 04:08 ChenSammi

how do we verify that datanode downgrade will be impacted or not?

The docker based upgrade/downgrade tests run a downgrade and upgrade back to the previous release as part of each CI run. The full matrix to cover older version downgrades like 1.1.0 needs to be run manually, either by the release manager or in case a change is suspected of having downgrade implications like this one. See docs and test runner. The tests are already slow and complex, so if using an arm machine like M1 mac, modifying the github actions and test.sh to run the tests on your Ozone fork might be better since we only publish x86 release images right now.

For existing ozone user, we should consider still honor this property to keep backward compatibility if the implementation will not be too complex.

Ignoring the config, possibly with an associated log message, will not be backwards incompatible, as a datanode will not try to resume its in progress container imports/creates/deletes after a restart. However, using the old config exposes the cluster to bugs like we saw in this HDDS-6441 comment:

Container delete is in progress.
- The same would be true of container create or copying an import from the working dir to its destination.
I/O exception or datanode restart occurs.
On startup datanode finds incomplete container pieces in its volume working directories and logs and error.

Normally I would be in favor of respecting old configs, but I think this case is different as the config causes a bug.

Aug 10 '22 17:08 errose28

@errose28 thanks for the info about how the downgrade should be verified. Right, it makes sense to deprecate hdds.datanode.replication.work.dir considering all its known issues.

Hey @symious, could you update the patch accordingly?

Aug 16 '22 02:08 ChenSammi

@ChenSammi @errose28 Updated the patch, please help to review.

Aug 16 '22 03:08 symious

@ChenSammi Had a draft of the new proposal, please have a look. I'll enrich the PR later if the implementation is not too far from the proposal.

Aug 25 '22 15:08 symious

@ChenSammi Please help to review.

Sep 09 '22 07:09 symious