s3fs
s3fs copied to clipboard
regression: glob not accepting refresh arg anymore
glob
and some other functions were accepting refresh
as kwarg, this doesn't work in 2021.6.1 any more.
ls
still accepts refresh, find
doesn't...
This is a regression and probably due to changes in fsspec? Happend before: https://github.com/dask/s3fs/issues/207
fs.glob('my-bucket', refresh=True)
What happened:
fails with
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<timed exec> in <module>
/opt/conda/lib/python3.9/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
85 def wrapper(*args, **kwargs):
86 self = obj or args[0]
---> 87 return sync(self.loop, func, *args, **kwargs)
88
89 return wrapper
/opt/conda/lib/python3.9/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
66 raise FSTimeoutError
67 if isinstance(result[0], BaseException):
---> 68 raise result[0]
69 return result[0]
70
/opt/conda/lib/python3.9/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
22 coro = asyncio.wait_for(coro, timeout=timeout)
23 try:
---> 24 result[0] = await coro
25 except Exception as ex:
26 result[0] = ex
/opt/conda/lib/python3.9/site-packages/fsspec/asyn.py in _glob(self, path, **kwargs)
523 depth = None if "**" in path else path[ind + 1 :].count("/") + 1
524
--> 525 allpaths = await self._find(
526 root, maxdepth=depth, withdirs=True, detail=True, **kwargs
527 )
TypeError: _find() got an unexpected keyword argument 'refresh'
What you expected to happen:
Works as in previous versions
Environment:
- Python version: 3.9
- Operating System: Linux
- Install method (conda, pip, source): conda/pip
@isidentical , do you have appetite to look into this one? I can't find when this would have worked in the past, ever since the existence of s3fs's custom _find()
- because we never cache directories on the one-shot call anyway.
@flixr - have you found that glob is caching out-of-date listings?
I'm extremely blocked with other stuff right now, though from what I can see even on 0.6
there was no refresh=
parameter on the signature of _find()
so I am not sure what caused this regression.
So it definitely worked on s3fs 0.4.2, not sure about newer ones...
IIRC there was also no explicit refresh arg, but it was passed via kwargs: https://github.com/dask/s3fs/issues/207#issuecomment-515531222
The fsspec glob docstring also mentions that kwargs are passed to ls
(which understands refresh
), but it seems glob
doesn't call ls
anymore, but instead find
...
Any updates on this? As of now (2023.12. 2 version), .glob()
is caching the results, and there's no way to refresh it, even though the doc-string says that kwargs
are passed to .ls()
, and .ls()
accepts refresh
as a kwarg, and .glob()
fails with a TypeError
:
TypeError: S3FileSystem._find() got an unexpected keyword argument 'refresh'
The current implementation of glob
via find
effectively always has refresh=True
: it does not use or touch the directories cache. This is because it fetches the whole directory tree below the given path, rather than being per-directory based as ls
. That makes it potentially inefficient, and it could be improved - but would require logic to fill the directory cache and conversely see whether all of the subdirectories required already exist in the directory cache.
Thanks @martindurant for the answer. However, what I'm finding is that the results of .glob()
are cached by default. I tested on the latest version and with the 2023.10.0 version and the second call to .glob()
with the same URI loads instantly and without refreshing the values.
I stand corrected, only glob("path/**") (i.e., a deep search) has the behaviour I mentioned.