hail icon indicating copy to clipboard operation
hail copied to clipboard

[fs] basic sync tool

Open danking opened this issue 4 months ago • 2 comments

CHANGELOG: Introduce hailctl fs sync which robustly transfers one or more files between Amazon S3, Azure Blob Storage, and Google Cloud Storage.

There are really two distinct conceptual changes remaining here. Given my waning time available, I am not going to split them into two pull requests. The changes are:

  1. basename always agrees with the basename UNIX utility. In particular, the folder /foo/bar/baz/'s basename is not '' it is 'baz'. The only folders or objects whose basename is '' are objects whose name literally ends in a slash, e.g. an object named gs://foo/bar/baz/.

  2. hailctl fs sync, a robust copying tool with a user-friendly CLI.

hailctl fs sync comprises two pieces: plan.py and sync.py. The latter, sync.py is simple: it delegates to our existing copy infrastructure. That copy infastructure has been lightly modified to support this use-case. The former, plan.py, is a concurrent file system diff.

plan.py generates and sync.py consumes a "plan folder" containing these files:

  1. matches files whose names and sizes match. Two columns: source URL, destination URL.

  2. differs files or folders whose names match but either differ in size or differ in type. Four columns: source URL, destination URL, source state, destination state. The states are either: file, dif, or a size. If either state is a size, both states are sizes.

  3. srconly files only present in the source. One column: source URL.

  4. dstonly files only present in the destination. One column: destination URL.

  5. plan a proposed set of object-to-object copies. Two columns: source URL, destination URL.

  6. summary a one-line file containing the total number of copies in plan and the total number of bytes which would be copied.

As described in the CLI documentation, the intended use of these commands is:

hailctl fs sync --make-plan plan1 --copy-to gs://gcs-bucket/a s3://s3-bucket/b
hailctl fs sync --use-plan plan1

The first command generates a plan folder and the second command executes the plan. Separating this process into two commands allows the user to verify what exactly will be copied including the exact destination URLs. Moreover, if hailctl fs sync --use-plan fails, the user can re-run hailctl fs sync --make-plan to generate a new plan which will avoid copying already successfully copied files. Moreover, the user can re-run hailctl fs sync --make-plan to verify that every file was indeed successfully copied.

Testing. This change has a few sync-specific tests but largely reuses the tests for hailtop.aiotools.copy.

Future Work. Propagating a consistent kind of hash across all clouds and using that for detecting differences is a better solution than the file-size based difference used here. If all the clouds always provided the same type of hash value, this would be trivial to add. Alas, at time of writing, S3 and Google both support CRC32C for every blob (though, in S3, you must explicitly request it at object creation time), but Azure Blob Storage does not. ABS only supports MD5 sums which Google does not support for multi-part uploads.

danking avatar Feb 05 '24 17:02 danking