filesystem_spec
filesystem_spec copied to clipboard
HTTP implementaion not comaptible with Azure BlobStorage
What I wanted to do: Read and write to an Azure BlobStorage via pre-signed URLs using the fsspec http implementation.
How I tried to do it: As fsspec does not support write mode for http directly, I used it im combination with a local file cache, as suggested in #1325 and it nearly worked. For testing I used a local Azurite Docker container and generated pre-signed URLs.
The issue(s):
-
The http implementation uses per default POST as method, whereas Azure expects PUT for file uploads. I couldn't find any configuration option for that, so I hardwired it for testing. https://github.com/fsspec/filesystem_spec/blob/89626e89022dc9ed62aa7b0fb8dba9427fa892b3/fsspec/implementations/http.py#L267
-
Azure expects the content length set in the http header, but fsspec (or respectively aiohttp) uses "Transfer-Encoding: chunked" which excludes the content length in the header. Azure(ite) rejects such uploads.
I cross-checked my suspicion with Postman and in the REPL with httpx while watching at the HTTP requests in Wireshark. The transfer works when the content length is set, but the chunked mode appears to be the problem.
I see the advantage of using chunked mode, especially when using compression, but it would be helpful to have the option to just upload data with known size (like a file in the cache) so the content length could be set.
For 1.: as you see, the HTTP method can be defined in the call. I understand that in your case, this is buried a couple of calls down; it would be reasonable to have this as an option in the filesystem's __init__
or a config option, although I do hate to add yet more arguments :|
For 2.: we actually do know the size we intend to send, so there's no reason that the Length header should not be set even for chunked send - we set the size of the callback just above. The reason for chunked is to avoid reading a potentially large file into memory before starting to send. The alternative feels more like a pipe_file
operation; having said that, aiohttp prefers to read from a file-like object even if all in memory (BytesIO) because it makes the event loop more responsive.
, although I do hate to add yet more arguments :|
Understandable. But is there any reason to use POST as a default, as the function name _put_file()
itself implies the use of PUT?
so there's no reason that the Length header should not be set even for chunked send
Apparently the the standard defines that chunked transfer enconding and content length headers exclude each other.
From what I saw while debugging this, the size has to be set explicitly with the content length header or the data has to provide a size
property, otherwise it defaults to chunked content encoding.
That's happening here in aiohttp.
I was testing it with a few bytes of data and it still fractured into multiple chunks in separate tcp frames, which the server declined. That's aiohttp interna, but apparently the result of the content size being unknown to aiohttp at this point.
is there any reason to use POST as a default,
It seems to most common way to send a file to a HTTP server. I agree that the naming is bad.
the data has to provide a size property
That at least is easy enough to do. Does it then imply loading the whole file in one go?
Does it then imply loading the whole file in one go?
Not exactly. From how I understand it, it's not about breaking large data transfers up into smaller chunks, the "chunked transfer encoding" is intended for data transfers with unknown length, like an open channel for messages. (I just read into the matter, so take it with a grain of salt)
So, I did a bit of debugging. What happens apparently ...
-
In the
_put_file()
method, when uploading a file fsspec provides a generator object as data to theput
/post
method of aiohttp: https://github.com/fsspec/filesystem_spec/blob/89626e89022dc9ed62aa7b0fb8dba9427fa892b3/fsspec/implementations/http.py#L304 -
Internally aiohttp maps the data to different kinds of payload, depending on its type. Generators are mapped to
AsyncIterablePayload
objects, which unfortunately do not define asize
property. https://github.com/aio-libs/aiohttp/blob/5b4b7b84cf3fa0cec0384a3115b4d43636834c85/aiohttp/payload.py#L412 -
Aiohttp then creates the header and as it has no size information, it enables the chunked transfer encoding: https://github.com/aio-libs/aiohttp/blob/5b4b7b84cf3fa0cec0384a3115b4d43636834c85/aiohttp/client_reqrep.py#L476
The easiest fix might be setting the content length header for the request explicitly in the _put_file
method. It looks like that prevents chunking:
https://github.com/aio-libs/aiohttp/blob/5b4b7b84cf3fa0cec0384a3115b4d43636834c85/aiohttp/client_reqrep.py#L474
If I have more time I might test that.
If however you give aiohttp a large bytes
instance, it complains that you should really pass a file-like instead so that it can read and send it in pieces
In any case, if you would like to propose a PR with options to force the known size upload that works for azure, I'm sure we can figure out an API that isn't too complicated for users.
If however you give aiohttp a large bytes instance, it complains that you should really pass a file-like instead so that it can read and send it in pieces
Aiohttp has this registry to handle each payload datatype differently:
https://github.com/aio-libs/aiohttp/blob/a7bc5e9eeae7c5c90898411962e9a74bf10a9cef/aiohttp/payload.py#L447
The implementation for bytes
has a size limit. BytesIO
or BufferedReader
data could work.
On the other hand could the current implementation in fsspec work, if the "Content-Length"
header for the request is set. I'll tests that when I'm not drowning in work.