dask-examples icon indicating copy to clipboard operation
dask-examples copied to clipboard

Occasional failure in HTTP bytes

Open mrocklin opened this issue 5 years ago • 7 comments

When running CI in this project I sometimes run across the following error:

~/miniconda/envs/test/lib/python3.7/site-packages/dask/bag/core.py in reify()
   1603 def reify(seq):
   1604     if isinstance(seq, Iterator):
-> 1605         seq = list(seq)
   1606     if seq and isinstance(seq[0], Iterator):
   1607         seq = list(map(list, seq))
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bag/core.py in map_chunk()
   1769                 yield f(**k)
   1770     else:
-> 1771         for a in zip(*args):
   1772             yield f(*a)
   1773 
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bag/text.py in file_to_blocks()
    103 def file_to_blocks(lazy_file):
    104     with lazy_file as f:
--> 105         for line in f:
    106             yield line
    107 
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bytes/http.py in read()
    247             # EOF (python files don't error, just return no data)
    248             return b''
--> 249         self. _fetch(self.loc, end)
    250         data = self.cache[self.loc - self.start:end - self.start]
    251         self.loc = end
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bytes/http.py in _fetch()
    258             self.start = start
    259             self.end = end + self.blocksize
--> 260             self.cache = self._fetch_range(start, self.end)
    261         elif start < self.start:
    262             if self.end - end > self.blocksize:
~/miniconda/envs/test/lib/python3.7/site-packages/dask/bytes/http.py in _fetch_range()
    320             if cl <= end - start:
    321                 # data size OK
--> 322                 return r.content
    323             else:
    324                 raise ValueError('Got more bytes (%i) than requested (%i)' % (
~/miniconda/envs/test/lib/python3.7/site-packages/requests/models.py in content()
    826                 self._content = None
    827             else:
--> 828                 self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
    829 
    830         self._content_consumed = True
~/miniconda/envs/test/lib/python3.7/site-packages/requests/models.py in generate()
    751                         yield chunk
    752                 except ProtocolError as e:
--> 753                     raise ChunkedEncodingError(e)
    754                 except DecodeError as e:
    755                     raise ContentDecodingError(e)
ChunkedEncodingError: ('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))
ChunkedEncodingError: ('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))
You can ignore this error by setting the following in conf.py:
    nbsphinx_allow_errors = True
Notebook error:
CellExecutionError in applications/json-data-on-the-web.ipynb:
------------------
df.spec.value_counts().nlargest(20).to_frame().compute()
------------------

@martindurant , this seems to be in your general domain. Do you have any suggestions on what might be happening here?

mrocklin avatar Jun 24 '19 07:06 mrocklin

I'm not sure there's much we can do about broken connections, I can't see that it could be any fault of ours; retries could be built into the HTTPFileSystem, but perhaps it's better to retry the whole tasks in such cases.

martindurant avatar Jun 24 '19 15:06 martindurant

Is there a good reason to avoid retries in HTTPFileSystem?

mrocklin avatar Jun 24 '19 15:06 mrocklin

No, but a couple of things that make it tricky:

  • it is tricky to consider which set of errors should lead to a retry. Perhaps would have to retry everything
  • some things, like establishing the initial connection, are already retried by requests/urllib
  • if it's a timeout, then a set of retries might take a very long time to fail
  • in the fsspec implementation, there is a non-seekable fallback mode when the file-size is unavailable, that gives you a requests file-like object rather than a HTTPFile. I don't think we can easily intercept its read methods for the purposes of catching errors.

martindurant avatar Jun 24 '19 15:06 martindurant

This SO answer might be the best way to do it globally: https://stackoverflow.com/a/15431343/3821154 , allows you to be explicit about retries following a connection error that should apply to all connections within a session

martindurant avatar Jun 24 '19 15:06 martindurant

Quite some refactoring of fsspec's HTTP implementation lately.

Are dask tests still flaky? AFAICS, fsspec now returns an HTTPFile even if range requests are not possible. Does that mean a retry policy in fsspec makes more sense now @martindurant?

ahirner avatar Jul 13 '20 13:07 ahirner

HTTPFileSystem might now return a HTTPStreamFile where previously it returned a raw file-like requests response object. I don't think this changes anything from dask's point of view, except that we don't even try the "lets see if this is smaller than a block" approach. A retry would have to be for the whole of the request, not each call to read. However, a retry on establishing the connection (here) would make sense.

martindurant avatar Jul 13 '20 14:07 martindurant

(feel free to implement that in a PR, in case you have the time)

martindurant avatar Aug 07 '20 18:08 martindurant