s3fs icon indicating copy to clipboard operation
s3fs copied to clipboard

Add ability to check integrity of uploaded object

Open ciaransweet opened this issue 4 years ago • 5 comments

Hi Folks,

I've been reading https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/ and I'm wondering whether s3fs could/does already support this functionality?

I've got a situation where I request a file from a data provider and in a separate request I can get the checksum of the file.

What I'd like to do is use s3fs to write the file and verify the integrity of the uploaded object by providing the checksum the data provider gives me, something like:

r = requests.get("a-url", stream=True)
if r.status_code == 200:
    try:
        with s3fs.open(f"<bucket-name>/object.type", "wb", md5checksum="ACHECKSUM") as file_out:
            shutil.copyfileobj(r.raw, file_out)
    except IntegrityError as ex:
        # Tidy up here

I've tried using the .checksum() function, but this doesn't return me the correct checksum, I can download the file and get the correct checksum with hashlib.md5().hexdigest(), so I know it's uploaded fine...

Appreciate any pointers you might have on this!

ciaransweet avatar Feb 26 '21 13:02 ciaransweet

In gcsfs, this is handled explicitly. See, for example, the nice refactor by @nbren12, grouping the integrity checkers in https://github.com/dask/gcsfs/blob/main/gcsfs/checkers.py . That code could be upstreamed to fsspec and made available to other libraries such as this one. However, the case there is a bit different: it checks that the hash of the data being sent is equal to the has reported back, rather than a user-provided one. It doesn't seem like a bad idea, though!

martindurant avatar Feb 26 '21 14:02 martindurant

I wonder if there is an abstraction suitable for fsspec. Currently the checkers in gcsfs query google-specific http headers and json keys.

nbren12 avatar Feb 26 '21 16:02 nbren12

I'm not sure about the internals of s3fs but put_object in boto3 takes a ContentMD5 and will fail the request if your resultant file doesn't match the checksum. Guessing that isn't quite as easy to use under the hood here?

ciaransweet avatar Feb 26 '21 17:02 ciaransweet

s3fs uses (aio)botocore, whcih does support this same call

martindurant avatar Feb 26 '21 17:02 martindurant

I wonder if there is an abstraction suitable for fsspec.

The code that takes the uploaded data and forms a checksum would be the same, but how that checksum is used would be different.

martindurant avatar Feb 26 '21 17:02 martindurant