smart_open icon indicating copy to clipboard operation
smart_open copied to clipboard

Check files consistency between cloud providers storages

Open nenkie76 opened this issue 2 years ago • 0 comments

Hi,

I've been experimenting with smart_open and can't figure out which way can I ensure that files are consistent when coping data between GCS and S3 (and vice versa).

with open(uri=f"...",mode='rb',transport_params=dict(client=gcs_client)) as fout:
    with open(uri=f"...", mode='wb',transport_params=s3_tp) as fin:
        for line in fout:
            fin.write(line)

ETags are not matching (which is expected I guess), but files are different in size when copied from GCS to S3. gsutil shows size 1340495 bytes and after copying to s3 it's 1291979 bytes (though the file itself seems ok). I've tried turn off s3 multipart_upload, but that doesn't change the behaviour.

If I use below ordinary way to read/write files, my file size taken from gcs and written to s3 matches, and I can create validation process.

for blob in blobs:
    buffer = io.BytesIO()
    blob.download_to_file(buffer)
    buffer.seek(0)
    s3_client.put_object(Body=buffer, Bucket='...' Key=blob.name)

Which mechanism can be used to validate files consistency after copy?

PyDev console: 
macOS-13.1-arm64-arm-64bit
Python 3.10.5 (v3.10.5:f377153967, Jun  6 2022, 12:36:10) [Clang 13.0.0 (clang-1300.0.29.30)]
smart_open 6.3.0

nenkie76 avatar Dec 27 '22 11:12 nenkie76