GCSFilesystem creation is surprisingly slow
this snippet takes ~24s to run on my laptop in Germany:
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "fsspec",
# "gcsfs",
# ]
# ///
from fsspec import url_to_fs
from time import time
url = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'
if __name__ == '__main__':
start = time()
_ = url_to_fs(url)
elapsed = time() - start
print(elapsed)
26s seems a bit long for string parsing -- clearly something else is going on inside this function -- but I'm a bit surprised that there's anything else going on side here, given the apparent simplicity of the URL. Is this expected behavior, or something that should be fixed?
aha, further testing indicates that creating GCSFilesystem is the culprit. Any ideas how how to speed this up?
specifying token=anon reduces runtime to a fraction of a second:
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "fsspec",
# "gcsfs",
# ]
# ///
from fsspec import url_to_fs
from time import time
url = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'
if __name__ == '__main__':
start = time()
_ = url_to_fs(url, token='anon')
elapsed = time() - start
print(elapsed)
so I'm guessing the credentials search process is consuming most of the 25ish seconds I observed with the first version of this script.
the credentials search process is consuming most of the 25ish seconds
The default behaviour (unless otherwise specified) is to try various credential verification mechanisms in turn, until one works. anon is the fallback when everything else has failed. Unfortunately, there are many possibilities; maybe there's an opportunity to put very short timeouts on processes that ought to be fast, like _connect_cloud, which queries the GCP metadata service and is only available from within google's walls. It would be worthwhile putting timing around the methods in gcsfs.credentials.GoogleCredentials.connect to figure out if this is caused by one particular problematic method. For me, I don't see the problem:
2025-01-08 10:16:43,247 - gcsfs.credentials - DEBUG - connect -- Connection with method "google_default" failed
Traceback (most recent call last):
File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 235, in connect
self.connect(method=meth)
File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 253, in connect
self.__getattribute__("_connect_" + method)()
File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 77, in _connect_google_default
credentials, project = gauth.default(scopes=[self.scope])
File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/_default.py", line 615, in default
credentials, project_id = checker()
File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/_default.py", line 608, in <lambda>
lambda: _get_explicit_environ_credentials(quota_project_id=quota_project_id),
File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/_default.py", line 228, in _get_explicit_environ_credentials
credentials, project_id = load_credentials_from_file(
File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/_default.py", line 116, in load_credentials_from_file
raise exceptions.DefaultCredentialsError(
google.auth.exceptions.DefaultCredentialsError: File /Users/mdurant/Downloads/it-4397-aol-71cec3094fa0.json was not found.
FAIL google_default 0.0012540817260742188
2025-01-08 10:16:43,249 - gcsfs.credentials - DEBUG - connect -- Connection with method "cache" failed
Traceback (most recent call last):
File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 235, in connect
self.connect(method=meth)
File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 253, in connect
self.__getattribute__("_connect_" + method)()
File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 104, in _connect_cache
raise ValueError("No cached tokens")
ValueError: No cached tokens
FAIL cache 0.00011086463928222656
2025-01-08 10:16:43,298 - gcsfs.credentials - DEBUG - connect -- Connection with method "cloud" failed
Traceback (most recent call last):
File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/credentials.py", line 111, in refresh
self._retrieve_info(request)
File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/credentials.py", line 87, in _retrieve_info
info = _metadata.get_service_account_info(
File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/_metadata.py", line 234, in get_service_account_info
return get(request, path, params={"recursive": "true"})
File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/_metadata.py", line 163, in get
raise exceptions.TransportError(
google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 97, in _connect_cloud
self.credentials.refresh(req)
File "/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/google/auth/compute_engine/credentials.py", line 117, in refresh
six.raise_from(new_exc, caught_exc)
File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 235, in connect
self.connect(method=meth)
File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 253, in connect
self.__getattribute__("_connect_" + method)()
File "/Users/mdurant/code/gcsfs/gcsfs/credentials.py", line 99, in _connect_cloud
raise ValueError("Invalid gcloud credentials") from error
ValueError: Invalid gcloud credentials
FAIL cloud 0.050685882568359375
2025-01-08 10:16:43,299 - gcsfs.credentials - DEBUG - connect -- Connected with method anon
(I did the following to get the times:
--- a/gcsfs/credentials.py
+++ b/gcsfs/credentials.py
@@ -229,6 +229,8 @@ class GoogleCredentials:
self._connect_token(method)
elif method is None:
for meth in ["google_default", "cache", "cloud", "anon"]:
+ import time
+ t = time.time()
try:
self.connect(method=meth)
logger.debug("Connected with method %s", meth)
@@ -242,6 +244,7 @@ class GoogleCredentials:
# Reset credentials if they were set but the authentication failed
# (reverts to 'anon' behavior)
self.credentials = None
+ print("FAIL", meth, time.time()-t)
else:
# Since the 'anon' connection method should always succeed,
)
I noticed the same issue in universal-pathlib where the gcs tests where very slow, and it only occurred without a working internet connection, because then all the timeouts accumulated.
setting token='anon' restored the short runtime of the test suite, when running without access to the internet: https://github.com/fsspec/universal_pathlib/pull/312/commits/d034637bb038d93f2ff98697dc2a326d7bcc7c57
(the test suite runs against a gcs docker container)
I am unfortunately finding that running the default connection methods still happens fast and falls back to anon, even when I turn off the network.
@ap-- and I are both in germany, so perhaps regional connectivity is part of the story here
Thanks folks for bringing this up. It keeps biting us in different environments for a while now. I'm able to reproduce it easily at home and in the office and in many other different places.
-
Related PR - https://github.com/googleapis/google-auth-library-python/pull/1610 - our team put together this fix, to at least respect an environment variable (it's actually needed to completely disable metadata server auth even when we on GCE - since service account is not guaranteed to be setup and it fails even on GCE with an errors). This env variable might help to work with
fsspecas well. -
Here is a relevant discussion https://github.com/iterative/datachain/issues/740#issuecomment-2564793558 just 2 weeks ago.
I am unfortunately finding that running the default connection methods still happens fast and falls back to anon, even when I turn off the network.
@martindurant I think the difference should be here - https://github.com/googleapis/google-auth-library-python/blob/main/google/auth/compute_engine/_metadata.py#L207-L227
For me, I'm getting immediately into:
except exceptions.TransportError as e:
_LOGGER.warning(
"Compute Engine Metadata server unavailable on "
"attempt %s of %s. Reason: %s",
attempt,
retry_count,
e,
)
with a message: Failed to resolve 'metadata.google.internal' ([Errno 8] nodename nor servname provided, or not known)"))'
It means it goes into retry with an exponential backoff.
In your case, for some reason metadata.google.internal can be resolved into something? Or may be you are running it on GCE? it's hard to tell. I've seen something like this before with some DNS providers (AT&T in the US for example is famous for intercepting unknown DNS requests and do some weird stuff).
From our brief research (@0x2b3bfa0 can also add details) - it's not that easy to hack it out of the exponential backoff. It would be actually interesting to see what does it return in your case.
I was running from home, not GCP. I don't know whether it gave a different, non-retryable error. Since we completely control the compute_engine auth, we can add any timeout or retry arge in GoogleCredentials._connect_cloud . IF within GCP, this should always happen quickly.
Since we completely control the compute_engine auth, we can add any timeout or retry arge in GoogleCredentials._connect_cloud
I'm not sure I fully understand the exact meaning. Do you mean we can change the lib itself (compute_engine) or do you mean that we can change the way we work with it (GoogleCredentials._connect_cloud)?
I don't know from the top of my head and from that brief research a simple way to change the retry count or backoff logic outside. But we can check one more time. What we can do is to change the order or avoid metadata server if we are not on GCE (how does it determine that btw?).
Do you mean we can change the lib itself (compute_engine) or do you mean that we can change the way we work with it (GoogleCredentials._connect_cloud)?
I had mean the latter; but I'm not sure that configuring the Session of Request can do what we want - the retry logic is actually within google.auth.compute_engine._metadata.get(). We would need to override the Request's __call__ to set the default timeout
--- a/gcsfs/credentials.py
+++ b/gcsfs/credentials.py
@@ -36,6 +36,11 @@ client_config = {
}
+class TimeoutRequest(Request):
+ def __call__(self, *args, **kwargs):
+ return super().__call__(*args, timeout=1, **kwargs)
+
+
class GoogleCredentials:
def __init__(self, project, access, token, check_credentials=None):
self.scope = "https://www.googleapis.com/auth/devstorage." + access
@@ -93,7 +98,7 @@ class GoogleCredentials:
self.credentials = gauth.compute_engine.Credentials()
try:
with requests.Session() as session:
- req = Request(session)
+ req = TimeoutRequest(session)
self.credentials.refresh(req)
except gauth.exceptions.RefreshError as error:
raise ValueError("Invalid gcloud credentials") from error
Of course, I'd need someone to verify that cloud auth still works!
Otherwise, we could either vendor that (nasty!) code, or perhaps run with an explicit timeout (would require a thread or signal).
The previous default timeout is 120s (x5 retries?)
Yeah, I wonder though what was the logic behind setting up those timeouts on the compute engine side (we can break the things if we just override them). May be if we check what error code is returned and why on your machine we can find a better way?
Another alternative can be - check if we can detect that we are running on GCE and enable cloud auth only in that case?
Apparently, gcsfs is trying to use metadata server authentication regardless of the is_on_gce return value:
https://github.com/fsspec/gcsfs/blob/8a83230c671b9f1c1f500791e6b2b1810f8919b9/gcsfs/credentials.py#L93
Why does fsspec implement its own "default credentials" lookup mechanism, instead of relying on google.auth.default when necessary?
Why does fsspec implement its own "default credentials" lookup mechanism, instead of relying on google.auth.default when necessary?
This didn't capture all the ways people wanted to connect
is_on_gce
Sorry, where is this? We could skip the cloud connect if we know it will fail.
This didn't capture all the ways people wanted to connect
😅
Sorry, where is this?
https://github.com/googleapis/google-auth-library-python/blob/c3ea09fd8b9ee8f094263fdc809c59d9dfaa3124/google/auth/compute_engine/_metadata.py#L68
Sorry, where is this?
https://github.com/googleapis/google-auth-library-python/blob/c3ea09fd8b9ee8f094263fdc809c59d9dfaa3124/google/auth/compute_engine/_metadata.py#L68
It look like ping() in that module also calls the same endpoint with exponential backoff, so I don't suppose it gets us anything. We could copy the idea, I suppose, but I think a reasonable timeout on _connect_cloud is probably more straight-forward.
Would anyone here like to try my suggestion?