s3proxy
s3proxy copied to clipboard
Illegal characters in Blob metadata not handled gracefully
Hi. I'm investigating S3Proxy for suitability in a use case where I want a third-party application to use Azure Blob Storage via the S3 API. The third-party application is not open for modification, so I have to make do with whatever it is throwing at the API.
I'm running into an issue where the application is attempting to store metadata with a blob that contain illegal characters (i.e. a dash '-'). S3Proxy throws a 500 Internal Server error when this happens.
Encoding (and decoding on reads) may not be feasible. But I can imagine an opt-in feature that, if enabled, will store metadata in the Blob Index Tags instead of in Metadata. See https://learn.microsoft.com/en-us/azure/storage/blobs/storage-manage-find-blobs?tabs=azure-portal#choosing-between-metadata-and-blob-index-tags for details.
From the looks of it, the current behavior is blocking my use of S3Proxy, so I would be very appreciative if this improvement would be added.
Thanks for the report! This has come up a few times. To start, s3proxy shouldn't return a 500 error in this case. I think Azure returns 400, but need to double check. s3proxy should at least propagate the Azure error.
Since Azure is fairly restrictive, it may be useful to optionally base64 encode the metadata keys for Azure. I think that would be preferable to index tags, which appear more limited in their use (e.g. 10 tags per blob) and are not returned by GetBlob/GetBlobProperties.
+1 would also use this base64 encode/decode implemented
Could you explain this use case? If S3Proxy base64 encode metadata then non-S3Proxy clients will not be able to read it.
We're trying to use s3proxy between Dovecot+obox+s3 and Azure blobs. Dovecot obox stores emails and indices in S3 object storage, and the generic S3 support uses metadata for various identifiers. The Azure naming standard indicates that Azure object metadata must be named as C# Identifiers, which is alphanumeric plus underscore (plus a bunch of special stuff that's not immediately relevant here). The '@' symbol is magic as an escape for reserved words. However, Dovecot uses punctuation characters in its metadata names (specifically '-'), and so is not compatible, resulting in 500 errors as mentioned above when doing a PUT. What we need is a way to reliably and bidirectionally map something like "foo-bar" to a valid Azure metadata identifier and back again, so that Dovecot will work. As a comparison, Flexify successfully does this (though I've not investigated exactly how it does the mapping). Possibly one of the more obscure formatting characters is used as an escape for encoding the noncompliant characters. Ideally, this would be a configurable toggle for people who do not want/need the extra complexity, and of course only applicable to the Azure storage backend. Since our object storage will only be accessed by Dovecot, via S3proxy, the lack of compatibility with other non-s3proxy clients (e.g. Flexify) is not an issue. Note that for alphanumerics and underscores we're still compatible, though, so any use-case that has C#-Identifier-compliant metadata names you can have mixed clients with an Azure backing. You could potentially have multiple mapping options - keep as-is, strip noncompliant characters, base64-encode, escape-hex-encode noncompliant characters, etc - so people could choose based on their own use case. For us, either escaping or base64 would work as they are bidirectional.
OK, this makes sense if S3Proxy is the only or at least primary client. I think this is something that a middleware is appropriate for since it is non-standard behavior and maybe useful for other providers. Basically putBlob would munge the user metadata and getBlob and others would unmunge it. This is actually a straightforward task and a good first contribution. Could you look at ShardedBlobStore as an example?
Also agree that 500 error is unexpected and S3Proxy might not be propagating the error state correctly. This is a separate issue that should be addressed.
@sshipway Interesting; we are (were) working at the same use case, i.e. with Dovecot through S3Proxy to Blob Storage. We found another workaround, which did involve modifying S3Proxy. Turns out Dovecot uses fixed prefixes on the metadata: https://doc.dovecot.org/admin_manual/obox/storage_side_metadata/.
The modification we made was to have S3Proxy strip that prefix upon writing, and re-add it on reading. In fact, judging by the comments in the code, this is a feature that was planned as a canonical feature in S3Proxy (@gaul am I right on this?). This did the trick, at least functionally. We just didn't get around to submitting a PR for this - mainly because a PR should contain a generic and configurable solution, whereas we just made the changes we needed only to see if it would work.
And while it did work on a functional level, we were not able to achieve the performance we needed. So at the moment, we're exploring other options. If you do want to go forward with Dovecot on Azure Blob Storage, you may need to resort to installing the proxy alongside Dovecot, because we found the network latency that is introduced by having the proxy run on its own node to be prohibitively expensive.
@AnnejanBarelds Thanks for the link and info on the prefix! For Dovecot, it might be simplest to just map '-' to '_' and back again, for forwards compatibility? The mailbox-guid key also contains a '-' so just stripping the prefix won't help there... also, we're a bit wary of forking the project to use in a production environment. I've been running my tests using the s3proxy as a container on the dovecot host; in other environments we've used sproxyd on separate hosts over scality without latency issues but it may be different with s3. Dovecot used to have explicity support for Azure blobs in obox, but it was discontinued. We're pushing them to reinstate it. I'd be interested in discussing your solution to the Dovecot-Azure connection privately if you're able to - can you drop me an email (address in my profile)?
@sshipway Mapping - to _ and back will work as well, but only until a key is introduced that contains an underscore _. If that would happen, Dovecot would submit a key with _ and get back a key with -. So for forward compatibility, it's not a fail-safe option.
You're right about the mailbox-guid key, but we did not see any errors related to that. May be caused by the fact that mailbox-guid is a 'Dictmap only' key. But yes, there's no guarantee on forward compatibility there as well.
We're aware of the discontinued support for Azure Blob Storage, and we would endorse any initiative to get that back.
I'll drop you an email once I had a chance to align internally as to what we're willing to share and discuss at this point in time. Is that OK?