Info call makes unnecessary long listing for "directories"
When path is a "directory" (prefix) we might end up going into full listing operation to just get an info for a single path. We call ls and we don't pass any limits (and it seems GCSFS doesn't support max results atm?).
out = await self._ls(path, **kwargs) # Here we can go into a long listing since we don't limit it
out0 = [o for o in out if o["name"].rstrip("/") == path]
if out0:
# exact hit
return out0[0]
elif out:
# other stuff - must be a directory
Is it intentional for some reason?
On s3 and Azure, in similar place we list to check if at least a single key exists (we don't check for an exact hit).
Can we drop the exact hit check and pass max results into ls?
If I understand, you want there to be a max results argument to ls OR that there be a first-pass exact hit check?
On s3 and Azure, in similar place we list to check if at least a single key exists (we don't check for an exact hit).
It is totally reasonable that all three should have the same logic here, so if you can phrase this in code as a PR, I'd be happy to see it.
thanks @martindurant !
I think s3 / Azure logic is just to have a check that at least some objects exist under the prefix. They don't do exact hit check AFAIU, and they don't list everything.
So, the proposed solution is to:
- remove the exact check (not sure why it was needed tbh?)
- add max results supports to list as minimum as possible (
get_infoshould be a constant in terms of API calls - ideally a single call)
so if you can phrase this in code as a PR, I'd be happy to see it.
I'll try to get there. Intention of this issue was to ask if there were some reasons for the existing logic that I'm not aware / can't see.
Just adding a +1 here - this also applies to exists calls, so .exists(directory) ends up fetching info for every blob with that prefix, rather than just checking if any blob with that prefix exists. This is pretty catastrophically expensive for large directories, both in time and memory.
I note that here s3fs does indeed request exactly one key for the exists() check. I don't think this happens elsewhere.
I'd be happy to see a PR which enforces a single object for exists/info. It leaves open the question of whether to cache this result, though - we normally only store whole directory listings.
PR #676 fixes the info method by setting maxResults to 1, it no more relies on ls for listing. exists method internally relies on the implementation of info, so this fixes the performance issue with both the methods.
I've validated the calls before and after PR #676. Earlier it used to make multiple GET calls because of usage of ls(pagination with 5000 results per page). Now with the latest code it only makes 2 calls if the path is directory. One call is to check if the path is a file(which fails in case of directory) and one more call is to list objects(maxResults is set to 1, so just checks if some file exists with the given prefix).
@shcheklein Please let us know if this issue can be closed as we have the fix in place?
hey @Mahalaxmibejugam , yes absolutely. It's hard for me tell if it covers all the cases, it's been a while since I touched this, I'm totally fine to close this.