filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

GCSFilesystem creation is surprisingly slow

Open d-v-b opened this issue 1 year ago • 17 comments

this snippet takes ~24s to run on my laptop in Germany:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "fsspec",
#     "gcsfs",
# ]
# ///
from fsspec import url_to_fs
from time import time
url = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'
if __name__ == '__main__':
    start = time()
    _ = url_to_fs(url)
    elapsed = time() - start
    print(elapsed)

26s seems a bit long for string parsing -- clearly something else is going on inside this function -- but I'm a bit surprised that there's anything else going on side here, given the apparent simplicity of the URL. Is this expected behavior, or something that should be fixed?

d-v-b avatar Jan 08 '25 11:01 d-v-b

aha, further testing indicates that creating GCSFilesystem is the culprit. Any ideas how how to speed this up?

d-v-b avatar Jan 08 '25 11:01 d-v-b

specifying token=anon reduces runtime to a fraction of a second:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "fsspec",
#     "gcsfs",
# ]
# ///
from fsspec import url_to_fs
from time import time

url = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'

if __name__ == '__main__':
    start = time()
    _ = url_to_fs(url, token='anon')
    elapsed = time() - start
    print(elapsed)

so I'm guessing the credentials search process is consuming most of the 25ish seconds I observed with the first version of this script.

d-v-b avatar Jan 08 '25 11:01 d-v-b

the credentials search process is consuming most of the 25ish seconds

The default behaviour (unless otherwise specified) is to try various credential verification mechanisms in turn, until one works. anon is the fallback when everything else has failed. Unfortunately, there are many possibilities; maybe there's an opportunity to put very short timeouts on processes that ought to be fast, like _connect_cloud, which queries the GCP metadata service and is only available from within google's walls. It would be worthwhile putting timing around the methods in gcsfs.credentials.GoogleCredentials.connect to figure out if this is caused by one particular problematic method. For me, I don't see the problem:

2025-01-08 10:16:43,247 - gcsfs.credentials - DEBUG - connect -- Connection with method "google_default" failed
Traceback (most recent call last):
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 235, in connect
    self.connect(method=meth)
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 253, in connect
    self.__getattribute__("_connect_" + method)()
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 77, in _connect_google_default
    credentials, project = gauth.default(scopes=[self.scope])
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/_default.py", line 615, in default
    credentials, project_id = checker()
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/_default.py", line 608, in <lambda>
    lambda: _get_explicit_environ_credentials(quota_project_id=quota_project_id),
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/_default.py", line 228, in _get_explicit_environ_credentials
    credentials, project_id = load_credentials_from_file(
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/_default.py", line 116, in load_credentials_from_file
    raise exceptions.DefaultCredentialsError(
google.auth.exceptions.DefaultCredentialsError: File /Users/mdurant/Downloads/it-4397-aol-71cec3094fa0.json was not found.
FAIL google_default 0.0012540817260742188
2025-01-08 10:16:43,249 - gcsfs.credentials - DEBUG - connect -- Connection with method "cache" failed
Traceback (most recent call last):
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 235, in connect
    self.connect(method=meth)
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 253, in connect
    self.__getattribute__("_connect_" + method)()
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 104, in _connect_cache
    raise ValueError("No cached tokens")
ValueError: No cached tokens
FAIL cache 0.00011086463928222656
2025-01-08 10:16:43,298 - gcsfs.credentials - DEBUG - connect -- Connection with method "cloud" failed
Traceback (most recent call last):
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/credentials.py", line 111, in refresh
    self._retrieve_info(request)
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/credentials.py", line 87, in _retrieve_info
    info = _metadata.get_service_account_info(
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/_metadata.py", line 234, in get_service_account_info
    return get(request, path, params={"recursive": "true"})
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/_metadata.py", line 163, in get
    raise exceptions.TransportError(
google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 97, in _connect_cloud
    self.credentials.refresh(req)
  File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/credentials.py", line 117, in refresh
    six.raise_from(new_exc, caught_exc)
  File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 235, in connect
    self.connect(method=meth)
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 253, in connect
    self.__getattribute__("_connect_" + method)()
  File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 99, in _connect_cloud
    raise ValueError("Invalid gcloud credentials") from error
ValueError: Invalid gcloud credentials
FAIL cloud 0.050685882568359375
2025-01-08 10:16:43,299 - gcsfs.credentials - DEBUG - connect -- Connected with method anon

(I did the following to get the times:

--- a/gcsfs/credentials.py
+++ b/gcsfs/credentials.py
@@ -229,6 +229,8 @@ class GoogleCredentials:
             self._connect_token(method)
         elif method is None:
             for meth in ["google_default", "cache", "cloud", "anon"]:
+                import time
+                t = time.time()
                 try:
                     self.connect(method=meth)
                     logger.debug("Connected with method %s", meth)
@@ -242,6 +244,7 @@ class GoogleCredentials:
                     # Reset credentials if they were set but the authentication failed
                     # (reverts to 'anon' behavior)
                     self.credentials = None
+                    print("FAIL", meth, time.time()-t)
             else:
                 # Since the 'anon' connection method should always succeed,

)

martindurant avatar Jan 08 '25 15:01 martindurant

I noticed the same issue in universal-pathlib where the gcs tests where very slow, and it only occurred without a working internet connection, because then all the timeouts accumulated.

setting token='anon' restored the short runtime of the test suite, when running without access to the internet: https://github.com/fsspec/universal_pathlib/pull/312/commits/d034637bb038d93f2ff98697dc2a326d7bcc7c57

(the test suite runs against a gcs docker container)

ap-- avatar Jan 09 '25 14:01 ap--

I am unfortunately finding that running the default connection methods still happens fast and falls back to anon, even when I turn off the network.

martindurant avatar Jan 09 '25 14:01 martindurant

@ap-- and I are both in germany, so perhaps regional connectivity is part of the story here

d-v-b avatar Jan 09 '25 14:01 d-v-b

Thanks folks for bringing this up. It keeps biting us in different environments for a while now. I'm able to reproduce it easily at home and in the office and in many other different places.

  • Related PR - https://github.com/googleapis/google-auth-library-python/pull/1610 - our team put together this fix, to at least respect an environment variable (it's actually needed to completely disable metadata server auth even when we on GCE - since service account is not guaranteed to be setup and it fails even on GCE with an errors). This env variable might help to work with fsspec as well.

  • Here is a relevant discussion https://github.com/iterative/datachain/issues/740#issuecomment-2564793558 just 2 weeks ago.


I am unfortunately finding that running the default connection methods still happens fast and falls back to anon, even when I turn off the network.

@martindurant I think the difference should be here - https://github.com/googleapis/google-auth-library-python/blob/main/google/auth/compute_engine/_metadata.py#L207-L227

For me, I'm getting immediately into:

except exceptions.TransportError as e:
            _LOGGER.warning(
                "Compute Engine Metadata server unavailable on "
                "attempt %s of %s. Reason: %s",
                attempt,
                retry_count,
                e,
            )

with a message: Failed to resolve 'metadata.google.internal' ([Errno 8] nodename nor servname provided, or not known)"))'

It means it goes into retry with an exponential backoff.

In your case, for some reason metadata.google.internal can be resolved into something? Or may be you are running it on GCE? it's hard to tell. I've seen something like this before with some DNS providers (AT&T in the US for example is famous for intercepting unknown DNS requests and do some weird stuff).

From our brief research (@0x2b3bfa0 can also add details) - it's not that easy to hack it out of the exponential backoff. It would be actually interesting to see what does it return in your case.

shcheklein avatar Jan 14 '25 19:01 shcheklein

I was running from home, not GCP. I don't know whether it gave a different, non-retryable error. Since we completely control the compute_engine auth, we can add any timeout or retry arge in GoogleCredentials._connect_cloud . IF within GCP, this should always happen quickly.

martindurant avatar Jan 14 '25 19:01 martindurant

Since we completely control the compute_engine auth, we can add any timeout or retry arge in GoogleCredentials._connect_cloud

I'm not sure I fully understand the exact meaning. Do you mean we can change the lib itself (compute_engine) or do you mean that we can change the way we work with it (GoogleCredentials._connect_cloud)?

I don't know from the top of my head and from that brief research a simple way to change the retry count or backoff logic outside. But we can check one more time. What we can do is to change the order or avoid metadata server if we are not on GCE (how does it determine that btw?).

shcheklein avatar Jan 14 '25 19:01 shcheklein

Do you mean we can change the lib itself (compute_engine) or do you mean that we can change the way we work with it (GoogleCredentials._connect_cloud)?

I had mean the latter; but I'm not sure that configuring the Session of Request can do what we want - the retry logic is actually within google.auth.compute_engine._metadata.get(). We would need to override the Request's __call__ to set the default timeout

--- a/gcsfs/credentials.py
+++ b/gcsfs/credentials.py
@@ -36,6 +36,11 @@ client_config = {
 }


+class TimeoutRequest(Request):
+    def __call__(self, *args, **kwargs):
+        return super().__call__(*args, timeout=1, **kwargs)
+
+
 class GoogleCredentials:
     def __init__(self, project, access, token, check_credentials=None):
         self.scope = "https://www.googleapis.com/auth/devstorage." + access
@@ -93,7 +98,7 @@ class GoogleCredentials:
         self.credentials = gauth.compute_engine.Credentials()
         try:
             with requests.Session() as session:
-                req = Request(session)
+                req = TimeoutRequest(session)
                 self.credentials.refresh(req)
         except gauth.exceptions.RefreshError as error:
             raise ValueError("Invalid gcloud credentials") from error

Of course, I'd need someone to verify that cloud auth still works!

Otherwise, we could either vendor that (nasty!) code, or perhaps run with an explicit timeout (would require a thread or signal).

martindurant avatar Jan 14 '25 20:01 martindurant

The previous default timeout is 120s (x5 retries?)

martindurant avatar Jan 14 '25 20:01 martindurant

Yeah, I wonder though what was the logic behind setting up those timeouts on the compute engine side (we can break the things if we just override them). May be if we check what error code is returned and why on your machine we can find a better way?

Another alternative can be - check if we can detect that we are running on GCE and enable cloud auth only in that case?

shcheklein avatar Jan 14 '25 20:01 shcheklein

Apparently, gcsfs is trying to use metadata server authentication regardless of the is_on_gce return value:

https://github.com/fsspec/gcsfs/blob/8a83230c671b9f1c1f500791e6b2b1810f8919b9/gcsfs/credentials.py#L93

Why does fsspec implement its own "default credentials" lookup mechanism, instead of relying on google.auth.default when necessary?

0x2b3bfa0 avatar Jan 14 '25 20:01 0x2b3bfa0

Why does fsspec implement its own "default credentials" lookup mechanism, instead of relying on google.auth.default when necessary?

This didn't capture all the ways people wanted to connect

is_on_gce

Sorry, where is this? We could skip the cloud connect if we know it will fail.

martindurant avatar Jan 14 '25 21:01 martindurant

This didn't capture all the ways people wanted to connect

😅

Sorry, where is this?

https://github.com/googleapis/google-auth-library-python/blob/c3ea09fd8b9ee8f094263fdc809c59d9dfaa3124/google/auth/compute_engine/_metadata.py#L68

0x2b3bfa0 avatar Jan 15 '25 22:01 0x2b3bfa0

Sorry, where is this?

https://github.com/googleapis/google-auth-library-python/blob/c3ea09fd8b9ee8f094263fdc809c59d9dfaa3124/google/auth/compute_engine/_metadata.py#L68

It look like ping() in that module also calls the same endpoint with exponential backoff, so I don't suppose it gets us anything. We could copy the idea, I suppose, but I think a reasonable timeout on _connect_cloud is probably more straight-forward.

martindurant avatar Jan 16 '25 15:01 martindurant

Would anyone here like to try my suggestion?

martindurant avatar Jan 22 '25 17:01 martindurant