filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

BytesCache use of .fetcher might lead to a stall if original coroutine stalls?

Open yarikoptic opened this issue 6 months ago • 1 comments

We have an issue

  • https://github.com/dandi/dandi-cli/issues/1450 that one of our tests once in a while hangs (well, times out at 300 sec), but typically succeeds ok.
the "fuller" traceback is
dandi/pynwb_utils.py:206: in _get_pynwb_metadata
    with open_readable(path) as fp, h5py.File(fp, "r") as h5, NWBHDF5IO(
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/h5py/_hl/files.py:562: in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/h5py/_hl/files.py:235: in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper
    ???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper
    ???
h5py/h5f.pyx:102: in h5py.h5f.open
    ???
h5py/h5fd.pyx:163: in h5py.h5fd.H5FD_fileobj_read
    ???
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/spec.py:1915: in readinto
    data = self.read(out.nbytes)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/implementations/http.py:598: in read
    return super().read(length)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/spec.py:1897: in read
    out = self.cache._fetch(self.loc, self.loc + length)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/caching.py:481: in _fetch
    self.cache = self.fetcher(start, bend)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/asyn.py:118: in wrapper
    return sync(self.loop, func, *args, **kwargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/asyn.py:91: in sync
    if event.wait(1):
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/threading.py:558: in wait
    signaled = self._cond.wait(timeout)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/threading.py:306: in wait
    gotit = waiter.acquire(True, timeout)

relevant traceback is

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/caching.py:481: in _fetch
    self.cache = self.fetcher(start, bend)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/asyn.py:118: in wrapper
    return sync(self.loop, func, *args, **kwargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/asyn.py:91: in sync
    if event.wait(1):
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/threading.py:558: in wait
    signaled = self._cond.wait(timeout)

and the fact that asyn.py:sync signature is

def sync(loop, func, *args, timeout=None, **kwargs):

and code does not set any other timeout and keeps going until the end of times in that one unless original coroutine/thread signals:

    event = threading.Event()
    asyncio.run_coroutine_threadsafe(_runner(event, coro, result, timeout), loop)
    while True:
        # this loops allows thread to get interrupted
        if event.wait(1):
            break

but I guess in our case the http layer does not timeout promptly upon some odd connectivity issue? are reasonable timeouts set there?

yarikoptic avatar Aug 24 '24 03:08 yarikoptic