filesystem_spec
filesystem_spec copied to clipboard
IndexError: list index out of range when 400 HTTP errors
If we get a 400 HTTP error while trying to open a URL with:
fsspec.open(url)
this is not caught and instead an IndexError is raised:
IndexError: list index out of range
See for example:
File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 419, in open
return open_files(
File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 272, in open_files
fs, fs_token, paths = get_fs_token_paths(
File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 586, in get_fs_token_paths
fs = filesystem(protocol, **inkwargs)
File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/registry.py", line 253, in filesystem
return cls(**storage_options)
File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 76, in __call__
obj = super().__call__(*args, **kwargs)
File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/zip.py", line 54, in __init__
fo = fsspec.open(
File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 419, in open
return open_files(
File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 194, in __getitem__
out = super().__getitem__(item)
IndexError: list index out of range
I have made an investigation with a ZIP archive in Zenodo and with multiprocessing in Python 3.10.11, and there is indeed an underlying 429 HTTP error: Too Many Requests. See:
- huggingface/datasets#5862
- huggingface/datasets#5926
Maybe related to:
- #1256
"too many requests" actually sounds like it should be backoff-retriable.
Do you know if the original file is getting opened here? It seems like the exception should bubble up as is, or maybe translate to FileNotFound for a URL that cannot be retreived.
The URL is: https://zenodo.org/record/7700458/files/bimnli.zip?download=1 I also tried without the query parameter and the problem persists: https://zenodo.org/record/7700458/files/bimnli.zip
Do you know if the original file is getting opened here?
Yes, the previous call in the stack trace was:
file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
My concern is that from fsspec we only get the IndexError and all the information about the 429 HTTP error is lost, because of the last line in fsspec.open: [0]
https://github.com/fsspec/filesystem_spec/blob/fef43b296f40b23e67ee4519fa891bab04be39a9/fsspec/core.py#L419-L429
Does your URL actually have a "zip" component, or are you showing the URL after stripping, as seen by the HTTP backend?
I suppose the thing to do, it to make a path in our test server which raises an exception (400), and another returning 429 that only succeeds on the second attempt, and assert correct behaviour on both.
Yes, the URL had a "zip" component:
zip://bimnli/dev.jsonl::https://zenodo.org/record/7700458/files/bimnli.zip?download=1
When I try it, it works ok - I suppose I near to run it many times until I git the 429. Please bear with me.
Yes, I used multiprocessing with a pool of 5 workers, so that we get the TOO MANY REQUESTS from the Zenodo server.
Also see:
- https://github.com/huggingface/datasets/issues/5926
I tried this now using dask to spin up processes, and it have the following log output:
2023-06-06 11:39:16,835 - distributed.protocol.pickle - ERROR - Failed to serialize https://zenodo.org/record/7700458/files/bimnli.zip?download=1.
Traceback (most recent call last):
File "/Users/mdurant/code/filesystem_spec/fsspec/implementations/http.py", line 417, in _info
await _file_info(
File "/Users/mdurant/code/filesystem_spec/fsspec/implementations/http.py", line 837, in _file_info
r.raise_for_status()
File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='TOO MANY REQUESTS', url=URL('https://zenodo.org/record/7700458/files/bimnli.zip?download=1')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/distributed/worker.py", line 3232, in run
result = function(*args, **kwargs)
File "<ipython-input-41-06a4a0b3cf9e>", line 2, in <lambda>
File "/Users/mdurant/code/filesystem_spec/fsspec/core.py", line 134, in open
return self.__enter__()
File "/Users/mdurant/code/filesystem_spec/fsspec/core.py", line 102, in __enter__
f = self.fs.open(self.path, mode=mode)
File "/Users/mdurant/code/filesystem_spec/fsspec/spec.py", line 1237, in open
f = self._open(
File "/Users/mdurant/code/filesystem_spec/fsspec/implementations/http.py", line 356, in _open
size = size or self.info(path, **kwargs)["size"]
File "/Users/mdurant/code/filesystem_spec/fsspec/asyn.py", line 121, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/Users/mdurant/code/filesystem_spec/fsspec/asyn.py", line 106, in sync
raise return_result
File "/Users/mdurant/code/filesystem_spec/fsspec/asyn.py", line 61, in _runner
result[0] = await coro
File "/Users/mdurant/code/filesystem_spec/fsspec/implementations/http.py", line 430, in _info
raise FileNotFoundError(url) from exc
FileNotFoundError: https://zenodo.org/record/7700458/files/bimnli.zip?download=1
i.e., the correct error was raised (FileNotFoundError from ClientResponseError 429), but it could not be returned to the main process.
So there are few things to do here:
- ~the index error should clearly never happen, this is also FileNotFound, except that we no longer have the cause at that point~ (fixed)
- errors should be serialisable
- 429 should we retriable within HTTPFileSystem, with backoff (so this error should never surface at all). We don't currently implement our own retries (as gcsfs, s3 and others do) but it wouldn't be so hard. aiohttp will retry a much more limited set of cases where the HTTP connection never completd.
Thanks for your investigation, @martindurant.
@albertvillanova , do you have any intention of having a go at bullet 3 from above, which I think is the only thing you need here?