cloud-pipeline icon indicating copy to clipboard operation
cloud-pipeline copied to clipboard

pipe CLI: enable checksum calculation for S3

Open sidoruka opened this issue 1 year ago • 1 comments

pipe CLI shall be capable of:

  • Verify uploaded objects integrity (enabled by default)
  • Allow to define which algorithm shall be used for the checksum calculation (default: md5, override via --checksum-alg option)

Details: https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/

sidoruka avatar Apr 24 '24 08:04 sidoruka

According to the AWS doc: "A pre-calculated checksum value provided with the request disables automatic computation by the SDK and uses the provided value instead."

Let's look at this approach using as example CRC32 algorithm

Prerequisite

  • local file <filepath> that shall be uploaded to bucket <bucket> via path <key>

Pre-calculated checksum

Python3

import base64
import zlib

filepath = '<filepath>'
with open(filepath, 'rb') as f:
    crc_raw = zlib.crc32(f.read()) 

crc_bytes = crc_raw.to_bytes(4, 'big')
crc_base64 = base64.b64encode(crc_bytes).decode('utf-8')

Python2

import base64
import zlib
import struct

filepath = '<filepath>'
with open(filepath, 'rb') as stream:
    crc_raw = zlib.crc32(stream.read())

crc_bytes = struct.pack('>i', crc_raw)
crc_base64 = base64.b64encode(crc_bytes).decode('utf-8')

Upload file using AWS CLI

aws s3api put-object --bucket <bucket> --key <key> --checksum-crc32 "<crc_base64>" --body "<bucket>" 
Response:
{
    ...,
    "ChecksumCRC32": "<crc_base64>",
    ....
}

For example, to support it in pipe cli we need to add header: 'x-amz-checksum-crc32': '<crc_base64>' to put_object request

Check file checksum

aws s3api get-object-attributes --bucket <bucket> --key <key> --object-attributes Checksum 
Response:
{
    ...,
    "Checksum": {
        "ChecksumCRC32": "<crc_base64>"
    }
}

or using head object request

aws s3api get-object --bucket  <bucket> --key <key> --checksum-mode=ENABLED

Multipart upload:

  • checksum shall be calculated for each chunck
  • checksum-of-checksums shall be transmitted to S3 when the upload is finalized

Implementation steps for old boto3

Upload:

  • pre-calculate checksum for chunck
  • add header to put_object request - 'x-amz-checksum-crc32'='<...>'
  • multipart_upload (todo)

Download:

  • implement additional call to get_object (with additional header 'x-amz-checksum-mode'='ENABLED')
  • parse raw response headers and get checksum (e.g. obj.get()['ResponseMetadata']['HTTPHeaders']['x-amz-checksum-crc32'])
  • read response body and calculate checksum

ekazachkova avatar Apr 26 '24 16:04 ekazachkova