gdrivefs
gdrivefs copied to clipboard
Support for Shared Drives
Currently, gdrivefs doesn't support shared drives.
I have a setup like:
root_folder: str = "gdrive://Discovery Folder/Worksheets"
storage_options: dict = {
"token": "service_account",
"access": "read_only",
"creds": json.loads(os.environ["GOOGLE_APPLICATION_CREDENTIALS"]),
"root_file_id": "0123456789ABCDEFGH",
}
If I attempt to access that file (using commit 2b48baa11d1697401c914e5ff239dbab4d9c8f71), I get the error:
FileNotFoundError: Directory 0123456789ABCDEFGH has no child named Discovery Folder
File "./pipelines/assets/base.py", line 210, in original_files
with p.fs.open(p.path, mode="rb") as f:
File "./lib/python3.10/site-packages/fsspec/spec.py", line 1295, in open
f = self._open(
File "./lib/python3.10/site-packages/gdrivefs/core.py", line 249, in _open
return GoogleDriveFile(self, path, mode=mode, **kwargs)
File "./lib/python3.10/site-packages/gdrivefs/core.py", line 270, in __init__
super().__init__(fs, path, mode, block_size, autocommit=autocommit,
File "./lib/python3.10/site-packages/fsspec/spec.py", line 1651, in __init__
self.size = self.details["size"]
File "./lib/python3.10/site-packages/fsspec/spec.py", line 1664, in details
self._details = self.fs.info(self.path)
File "./lib/python3.10/site-packages/fsspec/spec.py", line 662, in info
out = self.ls(path, detail=True, **kwargs)
File "./lib/python3.10/site-packages/gdrivefs/core.py", line 174, in ls
files = self._ls_from_cache(path)
File "./lib/python3.10/site-packages/fsspec/spec.py", line 372, in _ls_from_cache
raise FileNotFoundError(path)
The root_file_id
is set to the folder id of a GDrive Shared Drive (i.e. https://support.google.com/a/users/answer/7212025?hl=en).
As per https://developers.google.com/drive/api/guides/enable-shareddrives#:~:text=The%20supportsAllDrives%3Dtrue%20parameter%20informs,require%20additional%20shared%20drive%20functionality. we need to set supportsAllDrives=True
and includeItemsFromAllDrives=True
when calling files.list
in order for the API client to find the files.
In my case, if I change the existing:
def _list_directory_by_id(self, file_id, trashed=False, path_prefix=None):
all_files = []
page_token = None
afields = 'nextPageToken, files(%s)' % fields
query = f"'{file_id}' in parents "
if not trashed:
query += "and trashed = false "
while True:
response = self.service.list(q=query,
spaces=self.spaces, fields=afields,
pageToken=page_token,
).execute()
for f in response.get('files', []):
all_files.append(_finfo_from_response(f, path_prefix))
more = response.get('incompleteSearch', False)
page_token = response.get('nextPageToken', None)
if page_token is None:
break
return all_files
to
def _list_directory_by_id(self, file_id, trashed=False, path_prefix=None):
all_files = []
page_token = None
afields = 'nextPageToken, files(%s)' % fields
query = f"'{file_id}' in parents "
if not trashed:
query += "and trashed = false "
while True:
response = self.service.list(
q=query,
spaces=self.spaces, fields=afields,
pageToken=page_token,
includeItemsFromAllDrives=True, # Required for shared drive support
supportsAllDrives=True, # Required for shared drive support
).execute()
for f in response.get('files', []):
all_files.append(_finfo_from_response(f, path_prefix))
more = response.get('incompleteSearch', False)
page_token = response.get('nextPageToken', None)
if page_token is None:
break
return all_files
(note the change in the call to self.service.list
)
then my code works, and the filesystem can find the file and open it successfully.
I am happy to prepare an MR, but you would need to decide whether you are happy for me to enable shared drive support in all cases, or whether you want to control it via storage_options
. And if via storage_options
whether it should default to off (completely backwards compatible) or on (may show new files to existing users with shared drives that they don't currently get returned from gdrivefs).
Actually, I see there was already a request for this in #26.
YEs, exactly so - I believe this is well worth adding, but I am unsure how to expose the possibility to users. I believe simply checking all possible drives every time is probably a substantial slowdown, but I am happy to be told otherwise.
@martindurant when you say "checking all possible drives" do you mean in the drives
property, or in _list_directory_by_id
?
I've only just started using gdrivefs, but it seems that you need to specify an exact path from the root folder set in the storage options, so I don't think enabling shared drives universally would be any slower - if you don't set the shared drive folder (or one of its subfolders) as the root_drive_id
in storage_options
then the filesystem won't be searching it.
And the mechanism that finds the exact file id executes one request/response per path segment, so the performance of that seems to be dependent on how many levels deep your path is from the root_folder_id
rather than how many other folders there are that don't match the path.