filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

HTTP implementaion not comaptible with Azure BlobStorage

Open honkomonk opened this issue 1 year ago • 7 comments

What I wanted to do: Read and write to an Azure BlobStorage via pre-signed URLs using the fsspec http implementation.

How I tried to do it: As fsspec does not support write mode for http directly, I used it im combination with a local file cache, as suggested in #1325 and it nearly worked. For testing I used a local Azurite Docker container and generated pre-signed URLs.

The issue(s):

  1. The http implementation uses per default POST as method, whereas Azure expects PUT for file uploads. I couldn't find any configuration option for that, so I hardwired it for testing. https://github.com/fsspec/filesystem_spec/blob/89626e89022dc9ed62aa7b0fb8dba9427fa892b3/fsspec/implementations/http.py#L267

  2. Azure expects the content length set in the http header, but fsspec (or respectively aiohttp) uses "Transfer-Encoding: chunked" which excludes the content length in the header. Azure(ite) rejects such uploads.

I cross-checked my suspicion with Postman and in the REPL with httpx while watching at the HTTP requests in Wireshark. The transfer works when the content length is set, but the chunked mode appears to be the problem.

I see the advantage of using chunked mode, especially when using compression, but it would be helpful to have the option to just upload data with known size (like a file in the cache) so the content length could be set.

honkomonk avatar Oct 16 '23 11:10 honkomonk

For 1.: as you see, the HTTP method can be defined in the call. I understand that in your case, this is buried a couple of calls down; it would be reasonable to have this as an option in the filesystem's __init__ or a config option, although I do hate to add yet more arguments :|

For 2.: we actually do know the size we intend to send, so there's no reason that the Length header should not be set even for chunked send - we set the size of the callback just above. The reason for chunked is to avoid reading a potentially large file into memory before starting to send. The alternative feels more like a pipe_file operation; having said that, aiohttp prefers to read from a file-like object even if all in memory (BytesIO) because it makes the event loop more responsive.

martindurant avatar Oct 16 '23 13:10 martindurant

, although I do hate to add yet more arguments :|

Understandable. But is there any reason to use POST as a default, as the function name _put_file() itself implies the use of PUT?

so there's no reason that the Length header should not be set even for chunked send

Apparently the the standard defines that chunked transfer enconding and content length headers exclude each other.

From what I saw while debugging this, the size has to be set explicitly with the content length header or the data has to provide a size property, otherwise it defaults to chunked content encoding.
That's happening here in aiohttp. I was testing it with a few bytes of data and it still fractured into multiple chunks in separate tcp frames, which the server declined. That's aiohttp interna, but apparently the result of the content size being unknown to aiohttp at this point.

honkomonk avatar Oct 16 '23 14:10 honkomonk

is there any reason to use POST as a default,

It seems to most common way to send a file to a HTTP server. I agree that the naming is bad.

the data has to provide a size property

That at least is easy enough to do. Does it then imply loading the whole file in one go?

martindurant avatar Oct 16 '23 14:10 martindurant

Does it then imply loading the whole file in one go?

Not exactly. From how I understand it, it's not about breaking large data transfers up into smaller chunks, the "chunked transfer encoding" is intended for data transfers with unknown length, like an open channel for messages. (I just read into the matter, so take it with a grain of salt)

So, I did a bit of debugging. What happens apparently ...

  1. In the _put_file() method, when uploading a file fsspec provides a generator object as data to the put/post method of aiohttp: https://github.com/fsspec/filesystem_spec/blob/89626e89022dc9ed62aa7b0fb8dba9427fa892b3/fsspec/implementations/http.py#L304

  2. Internally aiohttp maps the data to different kinds of payload, depending on its type. Generators are mapped to AsyncIterablePayload objects, which unfortunately do not define a size property. https://github.com/aio-libs/aiohttp/blob/5b4b7b84cf3fa0cec0384a3115b4d43636834c85/aiohttp/payload.py#L412

  3. Aiohttp then creates the header and as it has no size information, it enables the chunked transfer encoding: https://github.com/aio-libs/aiohttp/blob/5b4b7b84cf3fa0cec0384a3115b4d43636834c85/aiohttp/client_reqrep.py#L476

The easiest fix might be setting the content length header for the request explicitly in the _put_file method. It looks like that prevents chunking: https://github.com/aio-libs/aiohttp/blob/5b4b7b84cf3fa0cec0384a3115b4d43636834c85/aiohttp/client_reqrep.py#L474

If I have more time I might test that.

honkomonk avatar Oct 17 '23 08:10 honkomonk

If however you give aiohttp a large bytes instance, it complains that you should really pass a file-like instead so that it can read and send it in pieces

martindurant avatar Oct 17 '23 14:10 martindurant

In any case, if you would like to propose a PR with options to force the known size upload that works for azure, I'm sure we can figure out an API that isn't too complicated for users.

martindurant avatar Oct 18 '23 13:10 martindurant

If however you give aiohttp a large bytes instance, it complains that you should really pass a file-like instead so that it can read and send it in pieces

Aiohttp has this registry to handle each payload datatype differently: https://github.com/aio-libs/aiohttp/blob/a7bc5e9eeae7c5c90898411962e9a74bf10a9cef/aiohttp/payload.py#L447 The implementation for bytes has a size limit. BytesIO or BufferedReader data could work.

On the other hand could the current implementation in fsspec work, if the "Content-Length" header for the request is set. I'll tests that when I'm not drowning in work.

honkomonk avatar Oct 19 '23 10:10 honkomonk