quilt icon indicating copy to clipboard operation
quilt copied to clipboard

Use S3's built-in SHA256 hashes

Open dimaryaz opened this issue 2 years ago • 21 comments

  • Use S3's ChecksumAlgorithm='SHA256' whenever possible, instead of calculating hashes manually
  • Breaking change: switch from normal SHA256 to AWS's version of it: hash chunks, then hash the hashes

dimaryaz avatar Apr 06 '22 06:04 dimaryaz

@sir-sigurd: I removed a lot of the existing hashing code and made everything much simpler, for two reasons:

  • Hashing parts instead of whole files is better for parallelism, so makes everything easier (and I'm sure that's why AWS does it that way)
  • Since most of the time, hashing will be done by AWS, our own code's performance is less critical.

But, I didn't fully understand the existing code, so it's possible I missed something.

dimaryaz avatar Apr 06 '22 06:04 dimaryaz

Codecov Report

Attention: Patch coverage is 95.18900% with 14 lines in your changes are missing coverage. Please review.

Project coverage is 36.64%. Comparing base (683d8af) to head (71fa42f).

Files Patch % Lines
api/python/quilt3/data_transfer.py 94.59% 8 Missing :warning:
api/python/quilt3/packages.py 88.23% 6 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2782      +/-   ##
==========================================
+ Coverage   36.47%   36.64%   +0.16%     
==========================================
  Files         717      717              
  Lines       31755    31920     +165     
  Branches     4689     4689              
==========================================
+ Hits        11584    11697     +113     
- Misses      19015    19067      +52     
  Partials     1156     1156              
Flag Coverage Δ
api-python 90.89% <95.18%> (-0.65%) :arrow_down:
catalog 10.62% <ø> (ø)
lambda 87.74% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 06 '22 23:04 codecov[bot]

  1. it looks like we need to modify policies in registry since s3:GetObjectAttributes and s3:GetObjectVersionAttributes are required
  2. should we also modify catalog to enable checksums for uploads?
  3. does this work for public buckets?

sir-sigurd avatar Apr 08 '22 18:04 sir-sigurd

For something like quilt3.Package().set_dir('.', 's3://allencell/test').build('test/test', registry='s3://quilt-dev-null') will it still use the old algorithm?

No, it will use the new algorithm for everything (but will need to download the data and hash it locally)

dimaryaz avatar Apr 12 '22 07:04 dimaryaz

@dimaryaz a few questions just so I understand all the cases, is this correct?

  1. Object in S3 already has checksum - use that checksum and it's reproducible via client operations (install, verify)
  2. Object in S3 does not have checksum, will not be copied - use old SHA256 hash (requires streaming object)
  3. Object in S3 does not have checksum, will be copied - use checksum of copy
  4. Object local, does not have checksum, will be copied - use checksum of copy

akarve avatar Apr 12 '22 17:04 akarve

@akarve: not sure I understand what you mean here:

2. Object in S3 does not have checksum, will not be copied - use old SHA256 hash (requires streaming object)

In cases where you're e.g. building a package out of S3 URLs, it'll need to stream the object and calculate the hash - but it'll use the new algorithm, not the old one.

dimaryaz avatar Apr 13 '22 05:04 dimaryaz

Any idea what's going on with docs? I ran build.py, but it didn't do anything. Also, it's just blank lines.

dimaryaz avatar Apr 13 '22 05:04 dimaryaz

@dimaryaz does this PR supports quilt3 verify for packages with old hashes? https://github.com/quiltdata/quilt/blob/d88df8c826685f04d04aef0df4191f794ef2e824/api/python/quilt3/packages.py#L1625-L1660

sir-sigurd avatar Apr 13 '22 14:04 sir-sigurd

@dimaryaz does this PR supports quilt3 verify for packages with old hashes?

It does not. It's doable - we'd need to keep the old hashing code for that.

dimaryaz avatar Apr 13 '22 16:04 dimaryaz

@dimaryaz does this PR supports quilt3 verify for packages with old hashes? It does not. It's doable - we'd need to keep the old hashing code for that.

I am OK dropping support for quilt3 verify but in that case we should delete the code so users don't call it.

akarve avatar Apr 13 '22 16:04 akarve

I am OK dropping support for quilt3 verify but in that case we should delete the code so users don't call it.

You mean, we don't need it for new packages, either?

dimaryaz avatar Apr 13 '22 17:04 dimaryaz

I am OK dropping support for quilt3 verify but in that case we should delete the code so users don't call it. You mean, we don't need it for new packages, either?

Sorry, I didn't realize that it was an option. As long as we have an error message for the "old package" case (we can use manifest version) then the new client can fail quilt3 verify on old manifests and should still work on new manifests.

That reminds me, do we need to bump the manifest version?

akarve avatar Apr 13 '22 17:04 akarve

@akarve: not sure I understand what you mean here:

2. Object in S3 does not have checksum, will not be copied - use old SHA256 hash (requires streaming object)

In cases where you're e.g. building a package out of S3 URLs, it'll need to stream the object and calculate the hash - but it'll use the new algorithm, not the old one.

I didn't realize that we could use the new algorithm on an old object that isn't being copied. That sounds good as long as if the client copies the object in the future we enforce part-size and get the same checksum.

akarve avatar Apr 13 '22 20:04 akarve

As long as we have an error message for the "old package" case (we can use manifest version) then the new client can fail quilt3 verify on old manifests and should still work on new manifests.

The verify API just returns True or False - so unless we throw an exception instead, there isn't really a way to report an "old package". Also... right now, it doesn't appear to check the hash type at all - should definitely fail if it's unexpected.

That reminds me, do we need to bump the manifest version?

Not really. We're just changing the hash type - and the hash type is already part of the manifest format.

dimaryaz avatar Apr 19 '22 01:04 dimaryaz

As long as we have an error message for the "old package" case (we can use manifest version) then the new client can fail quilt3 verify on old manifests and should still work on new manifests.

The verify API just returns True or False - so unless we throw an exception instead, there isn't really a way to report an "old package". Also... right now, it doesn't appear to check the hash type at all - should definitely fail if it's unexpected.

That reminds me, do we need to bump the manifest version?

Not really. We're just changing the hash type - and the hash type is already part of the manifest format.

OK do you agree that old clients (which from the code don't even seem to check hash.type (right?)) that verify and install both will just fail with a new manifest whenever the physical key is multi-part? That's fine, btw, just confirming.

I can live with this as a minor update or a major one, but it shouldn't be a patch or silent. I am referring to the manifest version. Lmk what you think.

akarve avatar Apr 19 '22 02:04 akarve

OK do you agree that old clients (which from the code don't even seem to check hash.type (right?)) that verify and install both will just fail with a new manifest whenever the physical key is multi-part? That's fine, btw, just confirming.

Yes - though they'll fail even for non-multipart objects cause the new checksum is base64 encoded instead of base16.

dimaryaz avatar Apr 19 '22 03:04 dimaryaz

as I understand to calculate these new hashes you need to use a certain part sizing policy that is stable (i.e. works the same way for every client and every version), if so do we really need to include number of parts in hash?

I would still include it:

  • Just to be consistent with the way AWS does it
  • To catch common mismatches like multi-part upload vs normal upload (e.g., in the browser, S3 console does not seem to do multipart uploads at all)

if we need stable part sizing policy, do you think it's safe enough to rely on defaults of TransferConfig and implementation of ChunksizeAdjuster?

Yeah, I think that's the best we can do.

dimaryaz avatar Apr 20 '22 05:04 dimaryaz

Just to be consistent with the way AWS does it

Well, get_object_attributes() returns checksum without number of parts, so either option is eventually somewhat consistent with AWS :wink:

To catch common mismatches like multi-part upload vs normal upload (e.g., in the browser, S3 console does not seem to do multipart uploads at all)

S3 console does multipart uploads (for example this file was uploaded via S3 console), anyway I doesn't understand how it's helpful, could you explain further or maybe provide a full example?

Yeah, I think that's the best we can do.

Can't we have our own implementation?

I think that we use s3transfer could be a problem by itself, from https://github.com/boto/s3transfer/blob/develop/README.rst:

This project is not currently GA. If you are planning to use this code in production, make sure to lock to a minor version as interfaces may break from minor version to minor version. For a basic, stable interface of s3transfer, try the interfaces exposed in boto3

sir-sigurd avatar Apr 20 '22 07:04 sir-sigurd

S3 console does multipart uploads (for example this file was uploaded via S3 console), anyway I doesn't understand how it's helpful, could you explain further or maybe provide a full example?

Ok, maybe it has different config or something. I uploaded a 15MB file - but it didn't use a multi-part upload, even though TransferConfig's maximum is 8MB:

$ aws s3api get-object-attributes --bucket quilt-dima --key 'Blaise Car.mp4' --object-attributes ObjectSize,ObjectParts,Checksum
{
    "LastModified": "Thu, 21 Apr 2022 18:58:47 GMT",
    "VersionId": "id.G2YemnvIXL4H758csbGUEH86JY.2l",
    "Checksum": {
        "ChecksumSHA256": "84bMq5fjSOF5rBQUSK75wKD2vgM7wGgjj9S9fvD/Pu8="
    },
    "ObjectSize": 15814262
}

Can't we have our own implementation?

Yes - what I meant is, we should try to make our own implementation compatible with the defaults used by S3, but we can't always succeed. Anyways, we should really decide what our goals are here. I don't disagree with anything - just not sure what else I can do.

dimaryaz avatar Apr 21 '22 19:04 dimaryaz

Ok, maybe it has different config or something. I uploaded a 15MB file - but it didn't use a multi-part upload

I think it has 16 MB as multipart threshold: image

So in my example I uploaded 32 MiB file and its first part is 16.384 MiB and the second one is 15.616 MiB :confused:

what I meant is, we should try to make our own implementation compatible with the defaults used by S3

That makes sense, but I don't think such defaults exist: see my comment about multipart upload in S3 console, JS SDK has partSize (Number) — default: 5mb, in boto3 default part size is 8 MiB.

So I think goals are:

  1. Make our sizing policy stable (e.g. we don't want it to change because some defaults of some dependency changed with version update), because hash calculation relies on that
  2. This policy should be well-defined so it can be consistently implemented in all clients (e.g. we can't use s3transfer's ChunksizeAdjuster from catalog)

sir-sigurd avatar Apr 22 '22 04:04 sir-sigurd

Ping?

dimaryaz avatar May 09 '22 18:05 dimaryaz

Hi, we really need this feature. What's the current status? Seems to have been stuck for a while. Is anybody actively working on it?

PelleUllberg avatar Jan 24 '23 09:01 PelleUllberg

Hi, we really need this feature. What's the current status? Seems to have been stuck for a while. Is anybody actively working on it?

@PelleUllberg Thanks for your message - this is actively being worked on outside of this PR.

robnewman avatar Apr 21 '23 18:04 robnewman

Just rebased everything on top of master.

We now have a bit of a problem with the "dedupe" parameter: it is supposed to skip the package push if the remote top hash already matches the local top hash - but we might not know the top hash until we actually push.

dimaryaz avatar May 03 '23 23:05 dimaryaz

The code now matches the spec as far as I can tell - however, I'm pretty sure we cannot support --dedupe anymore because we might not know the hash until we're done pushing. @akarve - what do you think? Do we drop --dedupe?

dimaryaz avatar May 09 '23 18:05 dimaryaz

@dimaryaz could you update this (recently added) type? https://github.com/quiltdata/quilt/blob/7a82dfcd869035c5e71a5cc7cd912af35d72515c/api/python/quilt3/packages.py#L74C7-L81

sir-sigurd avatar Feb 23 '24 06:02 sir-sigurd

@nl0 - I already had the changelog entries, unless there's something else I missed?

dimaryaz avatar Feb 26 '24 18:02 dimaryaz