PyDrive2 icon indicating copy to clipboard operation
PyDrive2 copied to clipboard

Problem with the cache in `GDriveFileSystem.find`

Open simone-viozzi opened this issue 3 years ago • 7 comments

While working on https://github.com/iterative/PyDrive2/pull/222, I discovered that find has a bug with the cache.

Let assume self.path=root/tmp/ and a folder structure like:

root/tmp/
        └── fo1/
           ├── file2.pdf
           └── fo2/
              ├── file3.pdf
              └── fo3/
                 └── file4.pdf

now let's do some tests:

f = fs.find('root/tmp/fo1/')
print(f)
> ['root/tmp/fo1/file2.pdf', 'root/tmp/fo1/fo2/file3.pdf', 'root/tmp/fo1/fo2/fo3/file4.pdf']

f = fs.find('root/tmp/fo1/fo2')
print(f)
> ['root/tmp/fo1/fo2/fo3/file4.pdf', 'root/tmp/fo1/fo2/file3.pdf']

and that is correct, but if we do only find('root/tmp/fo1/fo2'):

f = fs.find('root/tmp/fo1/fo2')
print(f)
>[]

This happens because find relay on the cache, and at the start the cache is only populated with ids from one level down self.path

so in the last example, the content of the cache is:

{
'1IETDYYj23PgGaInZofa9MyANyBlOoiyh': 'tmp', 
'1k6u2-FStB6rOlq6hmDXlRl2aLES1l6vp': 'tmp/fo1', 
}

I think, because there is no tmp/fo1/fo2 (the starting path of find), query_ids stays empty and the method return an empty list.

The lines of code involved are: https://github.com/iterative/PyDrive2/blob/27bbf4c8b4ef7fc3cce97f645b0e79d6795bf73c/pydrive2/fs/spec.py#L469-L483

simone-viozzi avatar Sep 02 '22 16:09 simone-viozzi

@simone-viozzi yes, thanks for creating the ticket. find is deeply broken, unfortunately. It was originally implemented to serve DVC needs only and we never had time to get back and fix it properly :(

shcheklein avatar Sep 06 '22 18:09 shcheklein

Any idea of a roadmap to fix it? With this broken https://github.com/iterative/PyDrive2/pull/222 is pretty much useless, it will work unexpectedly 90% of the time.

I could try, but I still don't understand how the cache works right now, and how it should work. If I had to implement the cache mechanism from scratch, I would most likely use something like this

simone-viozzi avatar Sep 09 '22 11:09 simone-viozzi

With this broken https://github.com/iterative/PyDrive2/pull/222 is pretty much useless, it will work unexpectedly 90% of the time.

Hmm, I think cp doesn't depend on find. Cache itself is not broken, it's find implementation is bad (not general). I think tbh, that we could move forward and merge w/o cache. Especially the cp one.

shcheklein avatar Sep 10 '22 00:09 shcheklein

copy uses expand_path, which uses find. https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L877 https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L908

cp_file is fine, though, but works for just files.

simone-viozzi avatar Sep 10 '22 09:09 simone-viozzi

okay, problems that I see:

  • [ ] maxdepth is not supported
  • [ ] withdirs is not supported (must have for expand path)
  • [ ] doesn't check bucket name find('.') and find('etasderwer') return the same result
  • [ ] it keeps appending ids to cache every time even for already known
  • [ ] cache is not locked (multithreading support) - this is a bit more advanced and less critical to start with

shcheklein avatar Sep 10 '22 19:09 shcheklein

Hey @shcheklein, did you know that find is already implemented by ffspec and it uses walk: https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L423-L453

And also walk is already implemented by ffspec using just ls: https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L364-L421

simone-viozzi avatar Sep 21 '22 19:09 simone-viozzi

@simone-viozzi yes, I know. The default implementation was not good enough for the DVC that is the major driver / consumer of this fsspec for now.

shcheklein avatar Sep 21 '22 19:09 shcheklein