cloud-pipeline
cloud-pipeline copied to clipboard
pipe CLI: enable checksum calculation for S3
pipe CLI shall be capable of:
- Verify uploaded objects integrity (enabled by default)
- Allow to define which algorithm shall be used for the checksum calculation (default: md5, override via
--checksum-algoption)
Details: https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/
According to the AWS doc: "A pre-calculated checksum value provided with the request disables automatic computation by the SDK and uses the provided value instead."
Let's look at this approach using as example CRC32 algorithm
Prerequisite
- local file
<filepath>that shall be uploaded to bucket<bucket>via path<key>
Pre-calculated checksum
Python3
import base64
import zlib
filepath = '<filepath>'
with open(filepath, 'rb') as f:
crc_raw = zlib.crc32(f.read())
crc_bytes = crc_raw.to_bytes(4, 'big')
crc_base64 = base64.b64encode(crc_bytes).decode('utf-8')
Python2
import base64
import zlib
import struct
filepath = '<filepath>'
with open(filepath, 'rb') as stream:
crc_raw = zlib.crc32(stream.read())
crc_bytes = struct.pack('>i', crc_raw)
crc_base64 = base64.b64encode(crc_bytes).decode('utf-8')
Upload file using AWS CLI
aws s3api put-object --bucket <bucket> --key <key> --checksum-crc32 "<crc_base64>" --body "<bucket>"
Response:
{
...,
"ChecksumCRC32": "<crc_base64>",
....
}
For example, to support it in pipe cli we need to add header: 'x-amz-checksum-crc32': '<crc_base64>' to put_object request
Check file checksum
aws s3api get-object-attributes --bucket <bucket> --key <key> --object-attributes Checksum
Response:
{
...,
"Checksum": {
"ChecksumCRC32": "<crc_base64>"
}
}
or using head object request
aws s3api get-object --bucket <bucket> --key <key> --checksum-mode=ENABLED
Multipart upload:
- checksum shall be calculated for each chunck
- checksum-of-checksums shall be transmitted to S3 when the upload is finalized
Implementation steps for old boto3
Upload:
- pre-calculate checksum for chunck
- add header to put_object request - 'x-amz-checksum-crc32'='<...>'
- multipart_upload (todo)
Download:
- implement additional call to get_object (with additional header 'x-amz-checksum-mode'='ENABLED')
- parse raw response headers and get checksum (e.g. obj.get()['ResponseMetadata']['HTTPHeaders']['x-amz-checksum-crc32'])
- read response body and calculate checksum