filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

IndexError: list index out of range when 400 HTTP errors

Open albertvillanova opened this issue 2 years ago • 12 comments

If we get a 400 HTTP error while trying to open a URL with:

fsspec.open(url)

this is not caught and instead an IndexError is raised:

IndexError: list index out of range

See for example:

  File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 419, in open
    return open_files(
  File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 272, in open_files
    fs, fs_token, paths = get_fs_token_paths(
  File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 586, in get_fs_token_paths
    fs = filesystem(protocol, **inkwargs)
  File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/registry.py", line 253, in filesystem
    return cls(**storage_options)
  File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 76, in __call__
    obj = super().__call__(*args, **kwargs)
  File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/zip.py", line 54, in __init__
    fo = fsspec.open(
  File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 419, in open
    return open_files(
  File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 194, in __getitem__
    out = super().__getitem__(item)
IndexError: list index out of range

I have made an investigation with a ZIP archive in Zenodo and with multiprocessing in Python 3.10.11, and there is indeed an underlying 429 HTTP error: Too Many Requests. See:

  • huggingface/datasets#5862
  • huggingface/datasets#5926

Maybe related to:

  • #1256

albertvillanova avatar May 15 '23 13:05 albertvillanova

"too many requests" actually sounds like it should be backoff-retriable.

Do you know if the original file is getting opened here? It seems like the exception should bubble up as is, or maybe translate to FileNotFound for a URL that cannot be retreived.

martindurant avatar May 15 '23 14:05 martindurant

The URL is: https://zenodo.org/record/7700458/files/bimnli.zip?download=1 I also tried without the query parameter and the problem persists: https://zenodo.org/record/7700458/files/bimnli.zip

albertvillanova avatar May 15 '23 14:05 albertvillanova

Do you know if the original file is getting opened here?

Yes, the previous call in the stack trace was:

file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()

albertvillanova avatar May 15 '23 14:05 albertvillanova

My concern is that from fsspec we only get the IndexError and all the information about the 429 HTTP error is lost, because of the last line in fsspec.open: [0] https://github.com/fsspec/filesystem_spec/blob/fef43b296f40b23e67ee4519fa891bab04be39a9/fsspec/core.py#L419-L429

albertvillanova avatar May 15 '23 14:05 albertvillanova

Does your URL actually have a "zip" component, or are you showing the URL after stripping, as seen by the HTTP backend?

I suppose the thing to do, it to make a path in our test server which raises an exception (400), and another returning 429 that only succeeds on the second attempt, and assert correct behaviour on both.

martindurant avatar May 16 '23 13:05 martindurant

Yes, the URL had a "zip" component:

zip://bimnli/dev.jsonl::https://zenodo.org/record/7700458/files/bimnli.zip?download=1

albertvillanova avatar May 17 '23 07:05 albertvillanova

When I try it, it works ok - I suppose I near to run it many times until I git the 429. Please bear with me.

martindurant avatar May 17 '23 15:05 martindurant

Yes, I used multiprocessing with a pool of 5 workers, so that we get the TOO MANY REQUESTS from the Zenodo server.

albertvillanova avatar May 22 '23 13:05 albertvillanova

Also see:

  • https://github.com/huggingface/datasets/issues/5926

albertvillanova avatar Jun 06 '23 14:06 albertvillanova

I tried this now using dask to spin up processes, and it have the following log output:

2023-06-06 11:39:16,835 - distributed.protocol.pickle - ERROR - Failed to serialize https://zenodo.org/record/7700458/files/bimnli.zip?download=1.
Traceback (most recent call last):
  File "/Users/mdurant/code/filesystem_spec/fsspec/implementations/http.py", line 417, in _info
    await _file_info(
  File "/Users/mdurant/code/filesystem_spec/fsspec/implementations/http.py", line 837, in _file_info
    r.raise_for_status()
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='TOO MANY REQUESTS', url=URL('https://zenodo.org/record/7700458/files/bimnli.zip?download=1')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/distributed/worker.py", line 3232, in run
    result = function(*args, **kwargs)
  File "<ipython-input-41-06a4a0b3cf9e>", line 2, in <lambda>
  File "/Users/mdurant/code/filesystem_spec/fsspec/core.py", line 134, in open
    return self.__enter__()
  File "/Users/mdurant/code/filesystem_spec/fsspec/core.py", line 102, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/Users/mdurant/code/filesystem_spec/fsspec/spec.py", line 1237, in open
    f = self._open(
  File "/Users/mdurant/code/filesystem_spec/fsspec/implementations/http.py", line 356, in _open
    size = size or self.info(path, **kwargs)["size"]
  File "/Users/mdurant/code/filesystem_spec/fsspec/asyn.py", line 121, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/Users/mdurant/code/filesystem_spec/fsspec/asyn.py", line 106, in sync
    raise return_result
  File "/Users/mdurant/code/filesystem_spec/fsspec/asyn.py", line 61, in _runner
    result[0] = await coro
  File "/Users/mdurant/code/filesystem_spec/fsspec/implementations/http.py", line 430, in _info
    raise FileNotFoundError(url) from exc
FileNotFoundError: https://zenodo.org/record/7700458/files/bimnli.zip?download=1

i.e., the correct error was raised (FileNotFoundError from ClientResponseError 429), but it could not be returned to the main process.

So there are few things to do here:

  • ~the index error should clearly never happen, this is also FileNotFound, except that we no longer have the cause at that point~ (fixed)
  • errors should be serialisable
  • 429 should we retriable within HTTPFileSystem, with backoff (so this error should never surface at all). We don't currently implement our own retries (as gcsfs, s3 and others do) but it wouldn't be so hard. aiohttp will retry a much more limited set of cases where the HTTP connection never completd.

martindurant avatar Jun 06 '23 15:06 martindurant

Thanks for your investigation, @martindurant.

albertvillanova avatar Jun 07 '23 07:06 albertvillanova

@albertvillanova , do you have any intention of having a go at bullet 3 from above, which I think is the only thing you need here?

martindurant avatar Jun 20 '23 15:06 martindurant