s3fs icon indicating copy to clipboard operation
s3fs copied to clipboard

regression: glob not accepting refresh arg anymore

Open flixr opened this issue 3 years ago • 3 comments

glob and some other functions were accepting refresh as kwarg, this doesn't work in 2021.6.1 any more.

ls still accepts refresh, find doesn't...

This is a regression and probably due to changes in fsspec? Happend before: https://github.com/dask/s3fs/issues/207

fs.glob('my-bucket', refresh=True)

What happened:

fails with

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>

/opt/conda/lib/python3.9/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
     85     def wrapper(*args, **kwargs):
     86         self = obj or args[0]
---> 87         return sync(self.loop, func, *args, **kwargs)
     88 
     89     return wrapper

/opt/conda/lib/python3.9/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
     66         raise FSTimeoutError
     67     if isinstance(result[0], BaseException):
---> 68         raise result[0]
     69     return result[0]
     70 

/opt/conda/lib/python3.9/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
     22         coro = asyncio.wait_for(coro, timeout=timeout)
     23     try:
---> 24         result[0] = await coro
     25     except Exception as ex:
     26         result[0] = ex

/opt/conda/lib/python3.9/site-packages/fsspec/asyn.py in _glob(self, path, **kwargs)
    523             depth = None if "**" in path else path[ind + 1 :].count("/") + 1
    524 
--> 525         allpaths = await self._find(
    526             root, maxdepth=depth, withdirs=True, detail=True, **kwargs
    527         )

TypeError: _find() got an unexpected keyword argument 'refresh'

What you expected to happen:

Works as in previous versions

Environment:

  • Python version: 3.9
  • Operating System: Linux
  • Install method (conda, pip, source): conda/pip

flixr avatar Jul 22 '21 10:07 flixr

@isidentical , do you have appetite to look into this one? I can't find when this would have worked in the past, ever since the existence of s3fs's custom _find() - because we never cache directories on the one-shot call anyway.

@flixr - have you found that glob is caching out-of-date listings?

martindurant avatar Jul 22 '21 13:07 martindurant

I'm extremely blocked with other stuff right now, though from what I can see even on 0.6 there was no refresh= parameter on the signature of _find() so I am not sure what caused this regression.

isidentical avatar Jul 22 '21 13:07 isidentical

So it definitely worked on s3fs 0.4.2, not sure about newer ones... IIRC there was also no explicit refresh arg, but it was passed via kwargs: https://github.com/dask/s3fs/issues/207#issuecomment-515531222 The fsspec glob docstring also mentions that kwargs are passed to ls (which understands refresh), but it seems glob doesn't call ls anymore, but instead find...

flixr avatar Jul 22 '21 14:07 flixr

Any updates on this? As of now (2023.12. 2 version), .glob() is caching the results, and there's no way to refresh it, even though the doc-string says that kwargs are passed to .ls(), and .ls() accepts refresh as a kwarg, and .glob() fails with a TypeError:

TypeError: S3FileSystem._find() got an unexpected keyword argument 'refresh'

joaomacalos avatar Jan 24 '24 11:01 joaomacalos

The current implementation of glob via find effectively always has refresh=True: it does not use or touch the directories cache. This is because it fetches the whole directory tree below the given path, rather than being per-directory based as ls. That makes it potentially inefficient, and it could be improved - but would require logic to fill the directory cache and conversely see whether all of the subdirectories required already exist in the directory cache.

martindurant avatar Jan 24 '24 14:01 martindurant

Thanks @martindurant for the answer. However, what I'm finding is that the results of .glob() are cached by default. I tested on the latest version and with the 2023.10.0 version and the second call to .glob() with the same URI loads instantly and without refreshing the values.

joaomacalos avatar Jan 24 '24 15:01 joaomacalos

I stand corrected, only glob("path/**") (i.e., a deep search) has the behaviour I mentioned.

martindurant avatar Jan 24 '24 15:01 martindurant