boto3 icon indicating copy to clipboard operation
boto3 copied to clipboard

Error downloading large files when server returns crc32 when using S3-compatible storage

Open hxddh opened this issue 8 months ago • 4 comments

Describe the bug

When using S3-compatible storage, when GetObject is used (more than 8MB), after the server returns the crc32 of the whole file, the SDK will check the crc32 of each range file, which will result in an error when downloading large files.

Regression Issue

  • [x] Select this option if this issue appears to be a regression.

Expected Behavior

When S3 is compatible with storing the crc32 of the returned file instead of the crc32 of the range slice, the SDK checksums only the crc32 of the full file.

Current Behavior

geobject_bos.log

`2025-04-07 02:54:59,752 - botocore.parsers - DEBUG - Response headers: {'Date': 'Mon, 07 Apr 2025 02:54:59 GMT', 'Content-Type': 'application/x-apple-diskimage', 'Content-Length': '8388608', 'Connection': 'keep-alive', 'Accept-Ranges': 'bytes', 'Content-Range': 'bytes 16777216-25165823/115452447', 'ETag': '"-334f81da6dab708e0a2ff7debac4b171"', 'Expires': 'Thu, 10 Apr 2025 02:54:59 GMT', 'Last-Modified': 'Tue, 01 Apr 2025 11:09:22 GMT', 'Server': 'BceBos', 'X-Amz-Checksum-Crc32': 'iXcQTg==', 'X-Amz-Id-2': 'PNxovhv15mlenXrgdTlJ2HiseDUCz8S/pqrliau4O0luY9a8/faytQzc7dgtXJ0jV8hyzN/LCVIeQ6PP1D5duQ==', 'X-Amz-Request-Id': '1a9b4d6f-29ba-4122-8494-bc9590cffb2a', 'X-Amz-Storage-Class': 'STANDARD', 'X-Bce-Flow-Control-Type': '-1', 'X-Bce-Is-Transition': 'false'} 2025-04-07 02:54:59,752 - botocore.parsers - DEBUG - Response body: <botocore.httpchecksum.StreamingChecksumBody object at 0x7f9f64719300> 2025-04-07 02:54:59,753 - botocore.hooks - DEBUG - Event needs-retry.s3.GetObject: calling handler <function _update_status_code at 0x7f9f656f5750> 2025-04-07 02:54:59,753 - botocore.hooks - DEBUG - Event needs-retry.s3.GetObject: calling handler <botocore.retryhandler.RetryHandler object at 0x7f9f6482c880> 2025-04-07 02:54:59,754 - botocore.retryhandler - DEBUG - No retry needed. 2025-04-07 02:54:59,754 - botocore.hooks - DEBUG - Event needs-retry.s3.GetObject: calling handler <bound method S3RegionRedirectorv2.redirect_from_error of <botocore.utils.S3RegionRedirectorv2 object at 0x7f9f6482c940>> 2025-04-07 02:54:59,939 - s3transfer.utils - DEBUG - Releasing acquire 0/None 2025-04-07 02:55:00,001 - s3transfer.utils - DEBUG - Releasing acquire 0/None 2025-04-07 02:55:00,006 - s3transfer.utils - DEBUG - Releasing acquire 0/None 2025-04-07 02:55:00,007 - s3transfer.utils - DEBUG - Releasing acquire 0/None 2025-04-07 02:55:00,027 - s3transfer.utils - DEBUG - Releasing acquire 0/None 2025-04-07 02:55:00,029 - s3transfer.utils - DEBUG - Releasing acquire 0/None 2025-04-07 02:55:00,035 - s3transfer.utils - DEBUG - Releasing acquire 0/None 2025-04-07 02:55:00,041 - s3transfer.utils - DEBUG - Releasing acquire 0/None 2025-04-07 02:55:00,111 - s3transfer.futures - DEBUG - Submitting task IORenameFileTask(transfer_id=0, {'final_filename': '/root/Downloads/Cherry-Studio-1.1.2-arm64.dmg'}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f9f6482ccd0> for transfer request: 0. 2025-04-07 02:55:00,111 - s3transfer.utils - DEBUG - Acquiring 0 2025-04-07 02:55:00,111 - s3transfer.utils - DEBUG - Releasing acquire 0/None 2025-04-07 02:55:00,111 - s3transfer.tasks - DEBUG - IORenameFileTask(transfer_id=0, {'final_filename': '/root/Downloads/Cherry-Studio-1.1.2-arm64.dmg'}) about to wait for the following futures [] 2025-04-07 02:55:00,111 - s3transfer.tasks - DEBUG - IORenameFileTask(transfer_id=0, {'final_filename': '/root/Downloads/Cherry-Studio-1.1.2-arm64.dmg'}) done waiting for dependent futures 2025-04-07 02:55:00,113 - s3transfer.utils - DEBUG - Releasing acquire 0/None 2025-04-07 02:55:00,115 - main - ERROR - 下载失败: Traceback (most recent call last): File "/root/bos_demo.py", line 80, in client.download_file(bucket_name, object_key, local_path) File "/usr/local/lib/python3.10/dist-packages/botocore/context.py", line 124, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/boto3/s3/inject.py", line 218, in download_file return transfer.download_file( File "/usr/local/lib/python3.10/dist-packages/boto3/s3/transfer.py", line 406, in download_file future.result() File "/usr/local/lib/python3.10/dist-packages/s3transfer/futures.py", line 111, in result return self._coordinator.result() File "/usr/local/lib/python3.10/dist-packages/s3transfer/futures.py", line 272, in result raise self._exception File "/usr/local/lib/python3.10/dist-packages/s3transfer/tasks.py", line 142, in call return self._execute_main(kwargs) File "/usr/local/lib/python3.10/dist-packages/s3transfer/tasks.py", line 165, in _execute_main return_value = self._main(**kwargs) File "/usr/local/lib/python3.10/dist-packages/s3transfer/download.py", line 581, in _main for chunk in chunks: File "/usr/local/lib/python3.10/dist-packages/s3transfer/download.py", line 726, in next chunk = self._body.read(self._chunksize) File "/usr/local/lib/python3.10/dist-packages/s3transfer/utils.py", line 614, in read value = self._stream.read(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/botocore/httpchecksum.py", line 243, in read self._validate_checksum() File "/usr/local/lib/python3.10/dist-packages/botocore/httpchecksum.py", line 252, in _validate_checksum raise FlexibleChecksumError(error_msg=error_msg) botocore.exceptions.FlexibleChecksumError: Expected checksum iXcQTg== did not match calculated checksum: AoSIYQ==


### Reproduction Steps

`import logging
import sys
import boto3
from botocore.client import Config
import os

# ------------------- Logging Configuration -------------------
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[logging.StreamHandler(sys.stdout)]
)
logging.getLogger('boto3').setLevel(logging.DEBUG)
logging.getLogger('botocore').setLevel(logging.DEBUG)
logging.getLogger('urllib3').setLevel(logging.DEBUG)
logger = logging.getLogger(__name__)

# ------------------- BOS Client Configuration -------------------
access_key = '*********'
secret_key = '*******'
endpoint_url = 'https://s3.bj.bcebos.com'
region_name = 'bj'

client = boto3.client(
    's3',
    endpoint_url=endpoint_url,
    aws_access_key_id=access_key,
    aws_secret_access_key=secret_key,
    region_name=region_name,
    config=Config(signature_version='s3v4')
)

# ------------------- List Buckets -------------------
try:
    logger.info("Listing buckets...")
    response = client.list_buckets()
    logger.debug("BOS response raw data: %s", response)
    print("Bucket list:")
    for bucket in response['Buckets']:
        logger.debug("Bucket discovered: %s", bucket['Name'])
        print(f"- {bucket['Name']}")
except Exception as e:
    logger.error("Failed to list buckets:", exc_info=True)

# ------------------- File Upload -------------------
bucket_name = 'my-unique-bucket-3'
object_key = 'test.txt'
file_path = './test.txt'

try:
    logger.info("Attempting to upload file: %s -> %s/%s", file_path, bucket_name, object_key)
    if not os.path.exists(file_path):
        logger.warning("Local file does not exist: %s", file_path)
        raise FileNotFoundError(f"File {file_path} not found")

    client.upload_file(file_path, bucket_name, object_key)
    logger.info("Upload successful, verifying file ETag: %s",
        client.head_object(Bucket=bucket_name, Key=object_key)['ETag'])
except Exception as e:
    logger.error("Upload failed:", exc_info=True)

# ------------------- File Download -------------------
object_key = 'Cherry-Studio-1.1.2-arm64.dmg'
local_path = '/root/Downloads/Cherry-Studio-1.1.2-arm64.dmg'

try:
    logger.info("Attempting to download file: %s/%s -> %s", bucket_name, object_key, local_path)

    # Check bucket permissions
    try:
        client.head_bucket(Bucket=bucket_name)
    except Exception as e:
        logger.error("Bucket access permission check failed:", exc_info=True)
        raise

    # Perform download
    os.makedirs(os.path.dirname(local_path), exist_ok=True)
    logger.debug("Local directory ensured: %s", os.path.dirname(local_path))

    client.download_file(bucket_name, object_key, local_path)
    logger.info("Download completed, file size: %d bytes", os.path.getsize(local_path))
except Exception as e:
    logger.error("Download failed:", exc_info=True)
`

### Possible Solution

When S3-compliant storage returns only the crc32 of the entire file, not the crc32 of the range slice, the SDK should check the complete file to avoid reporting errors.

### Additional Information/Context

_No response_

### SDK version used

1.37.28

### Environment details (OS name and version, etc.)

Ubuntu 22.04.5 LTS

hxddh avatar Apr 21 '25 03:04 hxddh

Hello @hxddh, thanks for bringing up the issue. It looks like you are using a 3rd Party provider called Baidu. Please ensure that the 3rd party is using most up to date checksum checker that S3 has ( https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html and https://aws.amazon.com/blogs/aws/introducing-default-data-integrity-protections-for-new-objects-in-amazon-s3/ and https://github.com/boto/boto3/issues/4392 ). Using AWS SDK, you could try:

import boto3
from botocore.client import Config
import os
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

client = boto3.client(
    's3',
    region_name='us-east-1',
    config=Config(signature_version='s3v4')
)

bucket_name = 'bucket-name'  # Replace with your bucket name
object_key = 'sample-file.txt'  # Replace with your object key
local_path = '/tmp/ample-file.txt'  # Replace with your desired download path

try:
    logger.info(f"Downloading file: {bucket_name}/{object_key} -> {local_path}")
    
    os.makedirs(os.path.dirname(local_path), exist_ok=True)
    
    client.download_file(bucket_name, object_key, local_path)
    logger.info(f"Download completed, file size: {os.path.getsize(local_path)} bytes")
except Exception as e:
    logger.error(f"Download failed: {str(e)}")

Please let me know if you have any questions. Thanks

adev-code avatar Apr 22 '25 21:04 adev-code

Hello @hxddh, thanks for bringing up the issue. It looks like you are using a 3rd Party provider called Baidu. Please ensure that the 3rd party is using most up to date checksum checker that S3 has ( https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html and https://aws.amazon.com/blogs/aws/introducing-default-data-integrity-protections-for-new-objects-in-amazon-s3/ and #4392 ). Using AWS SDK, you could try:

import boto3
from botocore.client import Config
import os
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

client = boto3.client(
    's3',
    region_name='us-east-1',
    config=Config(signature_version='s3v4')
)

bucket_name = 'bucket-name'  # Replace with your bucket name
object_key = 'sample-file.txt'  # Replace with your object key
local_path = '/tmp/ample-file.txt'  # Replace with your desired download path

try:
    logger.info(f"Downloading file: {bucket_name}/{object_key} -> {local_path}")
    
    os.makedirs(os.path.dirname(local_path), exist_ok=True)
    
    client.download_file(bucket_name, object_key, local_path)
    logger.info(f"Download completed, file size: {os.path.getsize(local_path)} bytes")
except Exception as e:
    logger.error(f"Download failed: {str(e)}")

Please let me know if you have any questions. Thanks

Thanks for your answer. Yes, I am using 3rd party compatible storage via the new version of the AWS SDK. I noticed the change in the S3 checksum method earlier. However, the question I have is, when GetObject, if the server side of the 3rd party cloud storage returns the crc32 value for the entire Object, what is the data validation logic of AWS SDK? From the error logs, it seems that the client test checks the crc32 of each range slice, which results in a failed check and an error. I understand that it would make more sense for the AWS SDK client to validate only the crc32 of the complete object file.

hxddh avatar Apr 28 '25 08:04 hxddh

Hi @hxddh, thanks for reaching out. The error has gave a mismatch from what was received from the 3rd party Expected checksum iXcQTg== did not match calculated checksum: AoSIYQ==. Since 3rd party was used, AWS SDK have no context on how the checksum happens on the 3rd party. For testing purposes, could you try going straight to S3 (not using 3rd party endpoint) and see if you get the same error?

adev-code avatar May 08 '25 17:05 adev-code

Greetings! It looks like this issue hasn’t been active in longer than five days. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.

github-actions[bot] avatar May 20 '25 00:05 github-actions[bot]

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.

github-actions[bot] avatar Aug 21 '25 20:08 github-actions[bot]