PyDrive2
PyDrive2 copied to clipboard
Problem with the cache in `GDriveFileSystem.find`
While working on https://github.com/iterative/PyDrive2/pull/222, I discovered that find has a bug with the cache.
Let assume self.path=root/tmp/ and a folder structure like:
root/tmp/
└── fo1/
├── file2.pdf
└── fo2/
├── file3.pdf
└── fo3/
└── file4.pdf
now let's do some tests:
f = fs.find('root/tmp/fo1/')
print(f)
> ['root/tmp/fo1/file2.pdf', 'root/tmp/fo1/fo2/file3.pdf', 'root/tmp/fo1/fo2/fo3/file4.pdf']
f = fs.find('root/tmp/fo1/fo2')
print(f)
> ['root/tmp/fo1/fo2/fo3/file4.pdf', 'root/tmp/fo1/fo2/file3.pdf']
and that is correct,
but if we do only find('root/tmp/fo1/fo2'):
f = fs.find('root/tmp/fo1/fo2')
print(f)
>[]
This happens because find relay on the cache, and at the start the cache is only populated with ids from one level down self.path
so in the last example, the content of the cache is:
{
'1IETDYYj23PgGaInZofa9MyANyBlOoiyh': 'tmp',
'1k6u2-FStB6rOlq6hmDXlRl2aLES1l6vp': 'tmp/fo1',
}
I think, because there is no tmp/fo1/fo2 (the starting path of find), query_ids stays empty and the method return an empty list.
The lines of code involved are: https://github.com/iterative/PyDrive2/blob/27bbf4c8b4ef7fc3cce97f645b0e79d6795bf73c/pydrive2/fs/spec.py#L469-L483
@simone-viozzi yes, thanks for creating the ticket. find is deeply broken, unfortunately. It was originally implemented to serve DVC needs only and we never had time to get back and fix it properly :(
Any idea of a roadmap to fix it? With this broken https://github.com/iterative/PyDrive2/pull/222 is pretty much useless, it will work unexpectedly 90% of the time.
I could try, but I still don't understand how the cache works right now, and how it should work. If I had to implement the cache mechanism from scratch, I would most likely use something like this
With this broken https://github.com/iterative/PyDrive2/pull/222 is pretty much useless, it will work unexpectedly 90% of the time.
Hmm, I think cp doesn't depend on find. Cache itself is not broken, it's find implementation is bad (not general). I think tbh, that we could move forward and merge w/o cache. Especially the cp one.
copy uses expand_path, which uses find.
https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L877
https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L908
cp_file is fine, though, but works for just files.
okay, problems that I see:
- [ ]
maxdepthis not supported - [ ]
withdirsis not supported (must have for expand path) - [ ] doesn't check bucket name
find('.')andfind('etasderwer')return the same result - [ ] it keeps appending ids to cache every time even for already known
- [ ] cache is not locked (multithreading support) - this is a bit more advanced and less critical to start with
Hey @shcheklein, did you know that find is already implemented by ffspec and it uses walk:
https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L423-L453
And also walk is already implemented by ffspec using just ls:
https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L364-L421
@simone-viozzi yes, I know. The default implementation was not good enough for the DVC that is the major driver / consumer of this fsspec for now.