quilt
quilt copied to clipboard
Use S3's built-in SHA256 hashes
- Use S3's
ChecksumAlgorithm='SHA256'
whenever possible, instead of calculating hashes manually - Breaking change: switch from normal SHA256 to AWS's version of it: hash chunks, then hash the hashes
@sir-sigurd: I removed a lot of the existing hashing code and made everything much simpler, for two reasons:
- Hashing parts instead of whole files is better for parallelism, so makes everything easier (and I'm sure that's why AWS does it that way)
- Since most of the time, hashing will be done by AWS, our own code's performance is less critical.
But, I didn't fully understand the existing code, so it's possible I missed something.
Codecov Report
Attention: Patch coverage is 95.18900%
with 14 lines
in your changes are missing coverage. Please review.
Project coverage is 36.64%. Comparing base (
683d8af
) to head (71fa42f
).
Files | Patch % | Lines |
---|---|---|
api/python/quilt3/data_transfer.py | 94.59% | 8 Missing :warning: |
api/python/quilt3/packages.py | 88.23% | 6 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #2782 +/- ##
==========================================
+ Coverage 36.47% 36.64% +0.16%
==========================================
Files 717 717
Lines 31755 31920 +165
Branches 4689 4689
==========================================
+ Hits 11584 11697 +113
- Misses 19015 19067 +52
Partials 1156 1156
Flag | Coverage Δ | |
---|---|---|
api-python | 90.89% <95.18%> (-0.65%) |
:arrow_down: |
catalog | 10.62% <ø> (ø) |
|
lambda | 87.74% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
- it looks like we need to modify policies in registry since
s3:GetObjectAttributes
ands3:GetObjectVersionAttributes
are required - should we also modify catalog to enable checksums for uploads?
- does this work for public buckets?
For something like
quilt3.Package().set_dir('.', 's3://allencell/test').build('test/test', registry='s3://quilt-dev-null')
will it still use the old algorithm?
No, it will use the new algorithm for everything (but will need to download the data and hash it locally)
@dimaryaz a few questions just so I understand all the cases, is this correct?
- Object in S3 already has checksum - use that checksum and it's reproducible via client operations (install, verify)
- Object in S3 does not have checksum, will not be copied - use old SHA256 hash (requires streaming object)
- Object in S3 does not have checksum, will be copied - use checksum of copy
- Object local, does not have checksum, will be copied - use checksum of copy
@akarve: not sure I understand what you mean here:
2. Object in S3 does not have checksum, will not be copied - use old SHA256 hash (requires streaming object)
In cases where you're e.g. building a package out of S3 URLs, it'll need to stream the object and calculate the hash - but it'll use the new algorithm, not the old one.
Any idea what's going on with docs? I ran build.py
, but it didn't do anything. Also, it's just blank lines.
@dimaryaz
does this PR supports quilt3 verify
for packages with old hashes?
https://github.com/quiltdata/quilt/blob/d88df8c826685f04d04aef0df4191f794ef2e824/api/python/quilt3/packages.py#L1625-L1660
@dimaryaz does this PR supports
quilt3 verify
for packages with old hashes?
It does not. It's doable - we'd need to keep the old hashing code for that.
@dimaryaz does this PR supports
quilt3 verify
for packages with old hashes? It does not. It's doable - we'd need to keep the old hashing code for that.
I am OK dropping support for quilt3 verify
but in that case we should delete the code so users don't call it.
I am OK dropping support for
quilt3 verify
but in that case we should delete the code so users don't call it.
You mean, we don't need it for new packages, either?
I am OK dropping support for
quilt3 verify
but in that case we should delete the code so users don't call it. You mean, we don't need it for new packages, either?
Sorry, I didn't realize that it was an option. As long as we have an error message for the "old package" case (we can use manifest version) then the new client can fail quilt3 verify
on old manifests and should still work on new manifests.
That reminds me, do we need to bump the manifest version?
@akarve: not sure I understand what you mean here:
2. Object in S3 does not have checksum, will not be copied - use old SHA256 hash (requires streaming object)
In cases where you're e.g. building a package out of S3 URLs, it'll need to stream the object and calculate the hash - but it'll use the new algorithm, not the old one.
I didn't realize that we could use the new algorithm on an old object that isn't being copied. That sounds good as long as if the client copies the object in the future we enforce part-size and get the same checksum.
As long as we have an error message for the "old package" case (we can use manifest version) then the new client can fail
quilt3 verify
on old manifests and should still work on new manifests.
The verify
API just returns True
or False
- so unless we throw an exception instead, there isn't really a way to report an "old package". Also... right now, it doesn't appear to check the hash type at all - should definitely fail if it's unexpected.
That reminds me, do we need to bump the manifest version?
Not really. We're just changing the hash type - and the hash type is already part of the manifest format.
As long as we have an error message for the "old package" case (we can use manifest version) then the new client can fail
quilt3 verify
on old manifests and should still work on new manifests.The
verify
API just returnsTrue
orFalse
- so unless we throw an exception instead, there isn't really a way to report an "old package". Also... right now, it doesn't appear to check the hash type at all - should definitely fail if it's unexpected.That reminds me, do we need to bump the manifest version?
Not really. We're just changing the hash type - and the hash type is already part of the manifest format.
OK do you agree that old clients (which from the code don't even seem to check hash.type
(right?)) that verify and install both will just fail with a new manifest whenever the physical key is multi-part? That's fine, btw, just confirming.
I can live with this as a minor update or a major one, but it shouldn't be a patch or silent. I am referring to the manifest version. Lmk what you think.
OK do you agree that old clients (which from the code don't even seem to check
hash.type
(right?)) that verify and install both will just fail with a new manifest whenever the physical key is multi-part? That's fine, btw, just confirming.
Yes - though they'll fail even for non-multipart objects cause the new checksum is base64 encoded instead of base16.
as I understand to calculate these new hashes you need to use a certain part sizing policy that is stable (i.e. works the same way for every client and every version), if so do we really need to include number of parts in hash?
I would still include it:
- Just to be consistent with the way AWS does it
- To catch common mismatches like multi-part upload vs normal upload (e.g., in the browser, S3 console does not seem to do multipart uploads at all)
if we need stable part sizing policy, do you think it's safe enough to rely on defaults of TransferConfig and implementation of ChunksizeAdjuster?
Yeah, I think that's the best we can do.
Just to be consistent with the way AWS does it
Well, get_object_attributes()
returns checksum without number of parts, so either option is eventually somewhat consistent with AWS :wink:
To catch common mismatches like multi-part upload vs normal upload (e.g., in the browser, S3 console does not seem to do multipart uploads at all)
S3 console does multipart uploads (for example this file was uploaded via S3 console), anyway I doesn't understand how it's helpful, could you explain further or maybe provide a full example?
Yeah, I think that's the best we can do.
Can't we have our own implementation?
I think that we use s3transfer could be a problem by itself, from https://github.com/boto/s3transfer/blob/develop/README.rst:
This project is not currently GA. If you are planning to use this code in production, make sure to lock to a minor version as interfaces may break from minor version to minor version. For a basic, stable interface of s3transfer, try the interfaces exposed in boto3
S3 console does multipart uploads (for example this file was uploaded via S3 console), anyway I doesn't understand how it's helpful, could you explain further or maybe provide a full example?
Ok, maybe it has different config or something. I uploaded a 15MB file - but it didn't use a multi-part upload, even though TransferConfig
's maximum is 8MB:
$ aws s3api get-object-attributes --bucket quilt-dima --key 'Blaise Car.mp4' --object-attributes ObjectSize,ObjectParts,Checksum
{
"LastModified": "Thu, 21 Apr 2022 18:58:47 GMT",
"VersionId": "id.G2YemnvIXL4H758csbGUEH86JY.2l",
"Checksum": {
"ChecksumSHA256": "84bMq5fjSOF5rBQUSK75wKD2vgM7wGgjj9S9fvD/Pu8="
},
"ObjectSize": 15814262
}
Can't we have our own implementation?
Yes - what I meant is, we should try to make our own implementation compatible with the defaults used by S3, but we can't always succeed. Anyways, we should really decide what our goals are here. I don't disagree with anything - just not sure what else I can do.
Ok, maybe it has different config or something. I uploaded a 15MB file - but it didn't use a multi-part upload
I think it has 16 MB as multipart threshold:
So in my example I uploaded 32 MiB file and its first part is 16.384 MiB and the second one is 15.616 MiB :confused:
what I meant is, we should try to make our own implementation compatible with the defaults used by S3
That makes sense, but I don't think such defaults exist: see my comment about multipart upload in S3 console, JS SDK has partSize (Number) — default: 5mb
, in boto3 default part size is 8 MiB.
So I think goals are:
- Make our sizing policy stable (e.g. we don't want it to change because some defaults of some dependency changed with version update), because hash calculation relies on that
- This policy should be well-defined so it can be consistently implemented in all clients (e.g. we can't use s3transfer's
ChunksizeAdjuster
from catalog)
Ping?
Hi, we really need this feature. What's the current status? Seems to have been stuck for a while. Is anybody actively working on it?
Hi, we really need this feature. What's the current status? Seems to have been stuck for a while. Is anybody actively working on it?
@PelleUllberg Thanks for your message - this is actively being worked on outside of this PR.
Just rebased everything on top of master.
We now have a bit of a problem with the "dedupe" parameter: it is supposed to skip the package push if the remote top hash already matches the local top hash - but we might not know the top hash until we actually push.
The code now matches the spec as far as I can tell - however, I'm pretty sure we cannot support --dedupe
anymore because we might not know the hash until we're done pushing. @akarve - what do you think? Do we drop --dedupe
?
@dimaryaz could you update this (recently added) type? https://github.com/quiltdata/quilt/blob/7a82dfcd869035c5e71a5cc7cd912af35d72515c/api/python/quilt3/packages.py#L74C7-L81
@nl0 - I already had the changelog entries, unless there's something else I missed?