s3: user metadata that contains "non-standard" characters is silently omitted
Describe the bug
Metadata values are parsed in the the snippet below.
https://github.com/apache/opendal/blob/f13ae4013d3b7bf074c68f00620f3b6f639387a9/core/src/raw/http_util/header.rs#L204-L216
This uses the function HeaderValue::to_str which will return an error when it encounters "invisible" characters which will in turn will cause the entire key-value pair to be filtered.
Solutions
To fix this, I'd suggest replacing value.to_str() with String::from_utf8(value.as_bytes().to_vec()) as well as throwing an Error if that fails.
That is, if there's no good reason for restricting the charset of that string - is there, in this case???
An alternative solution could be storing metadata in a wrapped HashMap<String, HeaderValue> and to offer a get(k) -> Result<String> and a get_raw(k) -> &[u8]
Steps to Reproduce
I've observed this with umlauts (such as ö, ä, etc), but there will be more characters that don't work:
- Call stat on an S3 object with the metadata
"x-amz-meta-notetoself": b"\xc3\xb6ha, das sollte eigentlich funktionieren"e.g. stored in Minioop.stat(key).await.user_metadata().unwrap().get("notetoself")->None - Observe that the existing metadata is not listed and there is no error about that either
Expected Behavior
- Any valid metadata should be accessible through the api
- if a problem (actually) occurs while parsing, it should cause an error instead of the value being silently omitted
Additional Context
No response
Are you willing to submit a PR to fix this bug?
- [ ] Yes, I would like to submit a PR.
Thank you for bringing this up.
That is, if there's no good reason for restricting the charset of that string - is there, in this case???
The HTTP specifications require the use of only visible US-ASCII octets (VCHAR), SP, and HTAB. Therefore, I don't believe these are valid values for the AWS S3 API.
However, I think we can improve our handling in this case, as ignoring invalid header values may confuse users. Perhaps we could return an error for this API instead, similar to the behavior of the AWS S3 Rust SDK.
In AWS S3 Rust SDK, they handle header values in this way: https://docs.rs/aws-smithy-http/latest/aws_smithy_http/header/fn.one_or_none.html
/// Read exactly one or none from a headers iterator
///
/// This function does not perform comma splitting like `read_many`
pub fn one_or_none<'a, T: FromStr>(
mut values: impl Iterator<Item = &'a str>,
) -> Result<Option<T>, ParseError>
where
T::Err: Error + Send + Sync + 'static,
{
let first = match values.next() {
Some(v) => v,
None => return Ok(None),
};
match values.next() {
None => T::from_str(first.trim())
.map_err(|err| ParseError::new("failed to parse string").with_source(err))
.map(Some),
Some(_) => Err(ParseError::new(
"expected a single value but found multiple",
)),
}
}
It's exactly the same which our current value.to_str() call, but returns an error instead of omitting it instead.
The HTTP specifications require the use of only visible US-ASCII octets (VCHAR), SP, and HTAB. Therefore, I don't believe these are valid values for the AWS S3 API.
Oh - I didn't know this, that's unfortunate. Then throwing an error is the only thing that can be done on this side, I suppose.
(Btw, seems like s3cmd will just return the non-US-ASCII.)
https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html
When using non-US-ASCII characters in your metadata values, the provided Unicode string is examined for non-US-ASCII characters. Values of such headers are character decoded as per RFC 2047 before storing and encoded as per RFC 2047 to make them mail-safe before returning. If the string contains only US-ASCII characters, it is presented as is.
Seems like minio s3 doesn't do this. I haven't created that metadata through opendal, but with a http form using a presigned request.
Does opendal allow setting metadata containing non-US-ASCII chars?