s3fs
s3fs copied to clipboard
botocore no longer populates the `Content-MD5` header leading to `MissingContentMD5` error
Hello,
As of version 1.36.0, botocore no longer populates the Content-MD5 header (see changelog entry here). This change was subsequently merged into aiobotocore as of version 2.18 (see commit here).
Practically, this now seems to mean that when I try to perform a delete operation on an S3FS file system I receive the following error:
File "/usr/local/lib/python3.12/site-packages/s3fs/core.py", line 114, in _error_wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/aiobotocore/client.py", line 412, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (MissingContentMD5) when calling the DeleteObjects operation: Missing required header for this request: Content-Md5.
So far my only work around is to pin aiobotocore < 2.18. I am using the latest S3FS (2024.12.0).
Thanks
Thanks for bringing this to my attention.
Is this against AWS, or another implementation of S3? If yes, how are you expected to delete files now?
No, I'm using S3FS to interact with an internal Minio instance (and to be honest I don't know enough about AWS/S3 to answer the follow up - it just appears to me to be a potentially very impactful change in behaviour).
Just to follow-up, have tried to look though what I believe to be the offending commit (here) and perhaps request_checksum_calculation now needs to be set?
OK, so I gather AWS must have switched to CRC and minio (maybe depending on deployment version) has not.
The doc suggests that changing the value of client_config.request_checksum_calculation to "when_supported" in the config (or the AWS_REQUEST_CHECKSUM_CALCULATION env variable) will only affect whether the CRC is calculated, never MD5, where all the associated code is marked deprecated. Maybe still worth a try?
Upstream Minio issue: https://github.com/minio/minio/issues/20845
We’re running into a similar issue, though it’s slightly different:
OSError: [Errno 22] An error occurred (MissingContentLength) when calling the PutObject operation: Unknown
It looks like this is a breaking change in boto3:
- https://github.com/boto/boto3/issues/4392
- https://github.com/boto/boto3/issues/4398
Would this be something that can be fixed in s3fs, or does it need to be handled in one of the dependencies?
botocore 1.36.0 also broke s3fs for my S3-compatible on-prem deployment. This is reproducible for me:
# /// script
# requires-python = ">=3.9"
# dependencies = [
# "pandas",
# "s3fs",
# "botocore==1.36",
# ]
# ///
import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(profile="my-profile")
df = pd.DataFrame({"my_col":[1, 2, 3]})
df.to_csv("/tmp/test_df.csv")
s3.put("/tmp/test_df.csv", "s3://my-bucket/my-prefix/test_df.csv")
# when botocore<1.36:
# ,my_col
# 0,1
# 1,2
# 2,3
# when botocore==1.36.0
# 14
# ,my_col
# 0,1
# 1,2
Essentially there is some kind of data corruption by a random string (or number?) being put at the top of my csv. In this case 14.
(ran the above as a PEP 722 script using uv)
As far as I know, the only solution currently is to downgrade botocore. I don't know if there's any scope for s3fs to add extra headers to add extra headers, since the values are calculated on the finished HTTP request after control has passed to botocore.
Unfortunately, it doesn't seem like botocore is interested in maintaining compatibility, since they explicitly target AWS.
Having said that, I'm surprised to see PutObject implicated too - either with the client error (which seems to be the same issue) or data corruption (which may well be something else). In the case of PutObject, we do always know the length of the body beforehand, so we can pass it explicitly if we know the header key required.
Perhaps someone can do a trace to see how the calls differ between the new and old botocore?
I have another emergency I need to deal with today...
@martindurant, I think changes made to checksum in PR https://github.com/boto/botocore/pull/3271 are likely causing this issue. Setting environment variable AWS_REQUEST_CHECKSUM_CALCULATION to WHEN_REQUIRED might address the issue.
@boringbyte , I don't think so. In fact, "required" is the default; setting it to the more general "when_available" doesn't help either, though, since it still produces a CRC rather than the previous behaviour with MD5.
Just to follow up, updating Minio to the latetst version (RELEASE.2025-01-20T14-49-07Z) resolved the issue for me. I therefore think this can be closed as this is an upstream boto / minio issue. Thank you
I'll leave it open for now as the ecosystem catches up - and maybe someone comes up with a way to inject those headers for older deployments.
It seems to me there is a way to disable this behaviour according to the issue on botocore: https://github.com/boto/boto3/issues/4398#issuecomment-2619946229
Is it not possible for us to pass in some kind of extra config to enable this?
That config can be changed via environment variable ( https://github.com/fsspec/s3fs/issues/931#issuecomment-2624091008 ), so please do try it!
Environment variables are fine and dandy, but it seems like a limited solution to need to know about and set an env var in every place this might be running. Plus, not all of us are here because we use s3fs directly - in my case it's because pyiceberg relies on s3fs. It would be much more effective imo for us and other libs using s3fs to be able to set a flag directly in our code that carries across to all environments.
Environment variables are fine and dandy,
The question is: does this workaround solve the problem? If yes, we can work out how to expose it programatically.
@martindurant I can confirm adding the environment variable fixes the problem.
export AWS_REQUEST_CHECKSUM_CALCULATION='WHEN_REQUIRED'
Thanks for testing.
request_checksum_calculation appears in the botocore config (https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html ), so I would try passing it using client_kwargs or config_kwargs to s3fs.
Assuming one/both of those works, then I suppose we are done: we have a workaround. However, we might still try to make this more prominent, provide extra documentation or try to catch that exact exception and provide remediation instructions.
It works with config_kwargs only.
Side note: Unfortunately, using request_checksum_calculation gives an error when boto3<1.36: TypeError: Got unexpected keyword argument 'request_checksum_calculation'
Unfortunately, using request_checksum_calculation gives an error when boto3<1.36
OK, so we certainly can't make this default.
What is the opinion here, is this thread enough to get people working again? Do we need a documentation note somewhere?
export AWS_REQUEST_CHECKSUM_CALCULATION='WHEN_REQUIRED'
Wanted to update that this fixes my data corruption issue posted about above: https://github.com/fsspec/s3fs/issues/931#issuecomment-2615824899
I think it would be very beneficial if s3fs was able to do something to automatically fix this. I realize this issue is not at all s3fs' fault. I just think that many power users will now have to remember to set this env var in all of their environments or scripts. Non power users - say users benefiting from pandas wrapping around it behind the scenes - will be very confused about why they're now getting data corruption. No error is actually thrown in my example above, which makes it even more difficult.
Couldn't we programmatically add that in the config kwargs by if we see that botocore > 1.36 is installed?
from importlib.metadata import version
major, minor, patch = version('botocore').split('.')
if int(major) >= 1 and int(minor) >= 36:
config_kwargs["request_checksum_calculation"] = "when_required"
Couldn't we programmatically add that in the config kwargs by if we see that botocore > 1.36 is installed?
Is it not the case that this config should not be set in the case that the endpoint is real AWS?
Hello, any estimated timeline to fix this?
As you can see from the above conversation, it's not entirely obvious what the right fix should be for a mix of botocore version and backend deployment. The environment variable workaround seems to be effective.
For me both above env is needed to bypass it with minio
AWS_REQUEST_CHECKSUM_CALCULATION=WHEN_REQUIRED
AWS_RESPONSE_CHECKSUM_VALIDATION=WHEN_REQUIRED
The error I got trying to write to s3-compatible Ceph storage (Open Storage Network) was OSError: [Errno 22] An error occurred (MissingContentLength) when calling the PutObject operation: Unknown, but armed with this error message, Gemini helpfully explained the problem and told me about setting the env vars: https://g.co/gemini/share/9082c6145d9c