Error downloading large files when server returns crc32 when using S3-compatible storage
Describe the bug
When using S3-compatible storage, when GetObject is used (more than 8MB), after the server returns the crc32 of the whole file, the SDK will check the crc32 of each range file, which will result in an error when downloading large files.
Regression Issue
- [x] Select this option if this issue appears to be a regression.
Expected Behavior
When S3 is compatible with storing the crc32 of the returned file instead of the crc32 of the range slice, the SDK checksums only the crc32 of the full file.
Current Behavior
`2025-04-07 02:54:59,752 - botocore.parsers - DEBUG - Response headers: {'Date': 'Mon, 07 Apr 2025 02:54:59 GMT', 'Content-Type': 'application/x-apple-diskimage', 'Content-Length': '8388608', 'Connection': 'keep-alive', 'Accept-Ranges': 'bytes', 'Content-Range': 'bytes 16777216-25165823/115452447', 'ETag': '"-334f81da6dab708e0a2ff7debac4b171"', 'Expires': 'Thu, 10 Apr 2025 02:54:59 GMT', 'Last-Modified': 'Tue, 01 Apr 2025 11:09:22 GMT', 'Server': 'BceBos', 'X-Amz-Checksum-Crc32': 'iXcQTg==', 'X-Amz-Id-2': 'PNxovhv15mlenXrgdTlJ2HiseDUCz8S/pqrliau4O0luY9a8/faytQzc7dgtXJ0jV8hyzN/LCVIeQ6PP1D5duQ==', 'X-Amz-Request-Id': '1a9b4d6f-29ba-4122-8494-bc9590cffb2a', 'X-Amz-Storage-Class': 'STANDARD', 'X-Bce-Flow-Control-Type': '-1', 'X-Bce-Is-Transition': 'false'}
2025-04-07 02:54:59,752 - botocore.parsers - DEBUG - Response body:
<botocore.httpchecksum.StreamingChecksumBody object at 0x7f9f64719300>
2025-04-07 02:54:59,753 - botocore.hooks - DEBUG - Event needs-retry.s3.GetObject: calling handler <function _update_status_code at 0x7f9f656f5750>
2025-04-07 02:54:59,753 - botocore.hooks - DEBUG - Event needs-retry.s3.GetObject: calling handler <botocore.retryhandler.RetryHandler object at 0x7f9f6482c880>
2025-04-07 02:54:59,754 - botocore.retryhandler - DEBUG - No retry needed.
2025-04-07 02:54:59,754 - botocore.hooks - DEBUG - Event needs-retry.s3.GetObject: calling handler <bound method S3RegionRedirectorv2.redirect_from_error of <botocore.utils.S3RegionRedirectorv2 object at 0x7f9f6482c940>>
2025-04-07 02:54:59,939 - s3transfer.utils - DEBUG - Releasing acquire 0/None
2025-04-07 02:55:00,001 - s3transfer.utils - DEBUG - Releasing acquire 0/None
2025-04-07 02:55:00,006 - s3transfer.utils - DEBUG - Releasing acquire 0/None
2025-04-07 02:55:00,007 - s3transfer.utils - DEBUG - Releasing acquire 0/None
2025-04-07 02:55:00,027 - s3transfer.utils - DEBUG - Releasing acquire 0/None
2025-04-07 02:55:00,029 - s3transfer.utils - DEBUG - Releasing acquire 0/None
2025-04-07 02:55:00,035 - s3transfer.utils - DEBUG - Releasing acquire 0/None
2025-04-07 02:55:00,041 - s3transfer.utils - DEBUG - Releasing acquire 0/None
2025-04-07 02:55:00,111 - s3transfer.futures - DEBUG - Submitting task IORenameFileTask(transfer_id=0, {'final_filename': '/root/Downloads/Cherry-Studio-1.1.2-arm64.dmg'}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f9f6482ccd0> for transfer request: 0.
2025-04-07 02:55:00,111 - s3transfer.utils - DEBUG - Acquiring 0
2025-04-07 02:55:00,111 - s3transfer.utils - DEBUG - Releasing acquire 0/None
2025-04-07 02:55:00,111 - s3transfer.tasks - DEBUG - IORenameFileTask(transfer_id=0, {'final_filename': '/root/Downloads/Cherry-Studio-1.1.2-arm64.dmg'}) about to wait for the following futures []
2025-04-07 02:55:00,111 - s3transfer.tasks - DEBUG - IORenameFileTask(transfer_id=0, {'final_filename': '/root/Downloads/Cherry-Studio-1.1.2-arm64.dmg'}) done waiting for dependent futures
2025-04-07 02:55:00,113 - s3transfer.utils - DEBUG - Releasing acquire 0/None
2025-04-07 02:55:00,115 - main - ERROR - 下载失败:
Traceback (most recent call last):
File "/root/bos_demo.py", line 80, in
### Reproduction Steps
`import logging
import sys
import boto3
from botocore.client import Config
import os
# ------------------- Logging Configuration -------------------
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[logging.StreamHandler(sys.stdout)]
)
logging.getLogger('boto3').setLevel(logging.DEBUG)
logging.getLogger('botocore').setLevel(logging.DEBUG)
logging.getLogger('urllib3').setLevel(logging.DEBUG)
logger = logging.getLogger(__name__)
# ------------------- BOS Client Configuration -------------------
access_key = '*********'
secret_key = '*******'
endpoint_url = 'https://s3.bj.bcebos.com'
region_name = 'bj'
client = boto3.client(
's3',
endpoint_url=endpoint_url,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
region_name=region_name,
config=Config(signature_version='s3v4')
)
# ------------------- List Buckets -------------------
try:
logger.info("Listing buckets...")
response = client.list_buckets()
logger.debug("BOS response raw data: %s", response)
print("Bucket list:")
for bucket in response['Buckets']:
logger.debug("Bucket discovered: %s", bucket['Name'])
print(f"- {bucket['Name']}")
except Exception as e:
logger.error("Failed to list buckets:", exc_info=True)
# ------------------- File Upload -------------------
bucket_name = 'my-unique-bucket-3'
object_key = 'test.txt'
file_path = './test.txt'
try:
logger.info("Attempting to upload file: %s -> %s/%s", file_path, bucket_name, object_key)
if not os.path.exists(file_path):
logger.warning("Local file does not exist: %s", file_path)
raise FileNotFoundError(f"File {file_path} not found")
client.upload_file(file_path, bucket_name, object_key)
logger.info("Upload successful, verifying file ETag: %s",
client.head_object(Bucket=bucket_name, Key=object_key)['ETag'])
except Exception as e:
logger.error("Upload failed:", exc_info=True)
# ------------------- File Download -------------------
object_key = 'Cherry-Studio-1.1.2-arm64.dmg'
local_path = '/root/Downloads/Cherry-Studio-1.1.2-arm64.dmg'
try:
logger.info("Attempting to download file: %s/%s -> %s", bucket_name, object_key, local_path)
# Check bucket permissions
try:
client.head_bucket(Bucket=bucket_name)
except Exception as e:
logger.error("Bucket access permission check failed:", exc_info=True)
raise
# Perform download
os.makedirs(os.path.dirname(local_path), exist_ok=True)
logger.debug("Local directory ensured: %s", os.path.dirname(local_path))
client.download_file(bucket_name, object_key, local_path)
logger.info("Download completed, file size: %d bytes", os.path.getsize(local_path))
except Exception as e:
logger.error("Download failed:", exc_info=True)
`
### Possible Solution
When S3-compliant storage returns only the crc32 of the entire file, not the crc32 of the range slice, the SDK should check the complete file to avoid reporting errors.
### Additional Information/Context
_No response_
### SDK version used
1.37.28
### Environment details (OS name and version, etc.)
Ubuntu 22.04.5 LTS
Hello @hxddh, thanks for bringing up the issue. It looks like you are using a 3rd Party provider called Baidu. Please ensure that the 3rd party is using most up to date checksum checker that S3 has ( https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html and https://aws.amazon.com/blogs/aws/introducing-default-data-integrity-protections-for-new-objects-in-amazon-s3/ and https://github.com/boto/boto3/issues/4392 ). Using AWS SDK, you could try:
import boto3
from botocore.client import Config
import os
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
client = boto3.client(
's3',
region_name='us-east-1',
config=Config(signature_version='s3v4')
)
bucket_name = 'bucket-name' # Replace with your bucket name
object_key = 'sample-file.txt' # Replace with your object key
local_path = '/tmp/ample-file.txt' # Replace with your desired download path
try:
logger.info(f"Downloading file: {bucket_name}/{object_key} -> {local_path}")
os.makedirs(os.path.dirname(local_path), exist_ok=True)
client.download_file(bucket_name, object_key, local_path)
logger.info(f"Download completed, file size: {os.path.getsize(local_path)} bytes")
except Exception as e:
logger.error(f"Download failed: {str(e)}")
Please let me know if you have any questions. Thanks
Hello @hxddh, thanks for bringing up the issue. It looks like you are using a 3rd Party provider called Baidu. Please ensure that the 3rd party is using most up to date checksum checker that S3 has ( https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html and https://aws.amazon.com/blogs/aws/introducing-default-data-integrity-protections-for-new-objects-in-amazon-s3/ and #4392 ). Using AWS SDK, you could try:
import boto3 from botocore.client import Config import os import logging logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') logger = logging.getLogger(__name__) client = boto3.client( 's3', region_name='us-east-1', config=Config(signature_version='s3v4') ) bucket_name = 'bucket-name' # Replace with your bucket name object_key = 'sample-file.txt' # Replace with your object key local_path = '/tmp/ample-file.txt' # Replace with your desired download path try: logger.info(f"Downloading file: {bucket_name}/{object_key} -> {local_path}") os.makedirs(os.path.dirname(local_path), exist_ok=True) client.download_file(bucket_name, object_key, local_path) logger.info(f"Download completed, file size: {os.path.getsize(local_path)} bytes") except Exception as e: logger.error(f"Download failed: {str(e)}")Please let me know if you have any questions. Thanks
Thanks for your answer. Yes, I am using 3rd party compatible storage via the new version of the AWS SDK. I noticed the change in the S3 checksum method earlier. However, the question I have is, when GetObject, if the server side of the 3rd party cloud storage returns the crc32 value for the entire Object, what is the data validation logic of AWS SDK? From the error logs, it seems that the client test checks the crc32 of each range slice, which results in a failed check and an error. I understand that it would make more sense for the AWS SDK client to validate only the crc32 of the complete object file.
Hi @hxddh, thanks for reaching out. The error has gave a mismatch from what was received from the 3rd party Expected checksum iXcQTg== did not match calculated checksum: AoSIYQ==. Since 3rd party was used, AWS SDK have no context on how the checksum happens on the 3rd party. For testing purposes, could you try going straight to S3 (not using 3rd party endpoint) and see if you get the same error?
Greetings! It looks like this issue hasn’t been active in longer than five days. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.
This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.