datachain icon indicating copy to clipboard operation
datachain copied to clipboard

GCP missing creds slows down significantly

Open dmpetrov opened this issue 11 months ago • 7 comments

Description

See quick start.

This command take 20 seconds (sometimes 30) without any progressbar if I don't have any GCP creds setup. If I put anon=True - it works instantly like it suppose to.

meta = DataChain.from_json("gs://datachain-demo/dogs-and-cats/*json", object_name="meta")

This affects all users since we use GCP in most of our examples and, I'm sure, majority of user do not have GCP creds set up.

What's needed: we need to avoid the delay when GCP creds are not set up. Ideally, without specifying anon=True which adds more code in examples and mental overhead for users.

ToDo:

  • [ ] Fix the issue
  • [ ] Remove anon=True from Readme and Get Started

dmpetrov avatar Dec 24 '24 02:12 dmpetrov

This is happening in SOMA co-working for me as well. The reason is the same issue as here https://github.com/iterative/studio/issues/10235 . Google lib has an exponential backoff on trying to reach out to the metadata service. So, when we don't set credentials explicitly to anon it is trying everything it could and thus hitting that exponential backoff. We need to wait for that PR that @0x2b3bfa0 made to land. And even after that users will have to specify an env variable.

I wonder if we can fix fsspec to disable metadata service if we are not on google (GC fsspec has a way to determine that AFAIU) - wdyt @0x2b3bfa0 ?

shcheklein avatar Dec 29 '24 17:12 shcheklein

This is happening in SOMA co-working for me as well

I'm experiencing the same at my home wifi

dmpetrov avatar Dec 30 '24 20:12 dmpetrov

Related: https://github.com/fsspec/filesystem_spec/issues/1768

skshetry avatar Jan 14 '25 05:01 skshetry

@skshetry thanks, I put a comment with the details there. It's a good / relevant discussion indeed.

shcheklein avatar Jan 14 '25 19:01 shcheklein

I wonder if we can fix fsspec to disable metadata service if we are not on google (GC fsspec has a way to determine that AFAIU) - wdyt @0x2b3bfa0 ?

I think exactly https://github.com/iterative/studio/issues/10235#issuecomment-2420957866; I've posted a comment at https://github.com/fsspec/filesystem_spec/issues/1768#issuecomment-2591088700.

0x2b3bfa0 avatar Jan 14 '25 20:01 0x2b3bfa0

The issue here is the expensive check to metadata anytime. The patch below will add an check to simple connect using google credentials so that anon will be applied if necessary.

diff --git a/src/datachain/client/gcs.py b/src/datachain/client/gcs.py
index 03e5245..be9a249 100644
--- a/src/datachain/client/gcs.py
+++ b/src/datachain/client/gcs.py
@@ -16,6 +16,7 @@ from .fsspec import DELIMITER, Client, ResultQueue
 # Patch gcsfs for consistency with s3fs
 GCSFileSystem.set_session = GCSFileSystem._set_session
 PageQueue = asyncio.Queue[Optional[Iterable[dict[str, Any]]]]
+ANONYMOUS_TOKEN = "anon"  # noqa: S105
 
 
 class GCSClient(Client):
@@ -24,12 +25,85 @@ class GCSClient(Client):
     protocol = "gs"
 
     @classmethod
-    def create_fs(cls, **kwargs) -> GCSFileSystem:
+    def _try_authenticate(cls, gcreds, method: str) -> bool:
+        """Attempt to authenticate using the specified method.
+
+        Args:
+            gcreds: Google credentials object
+            method: Authentication method to try
+
+        Returns:
+            bool: True if authentication succeeded, False otherwise
+        """
+        from google.auth.exceptions import GoogleAuthError
+
+        try:
+            gcreds.connect(method=method)
+            return True
+        except (GoogleAuthError, ValueError):
+            # Reset credentials if authentication failed (reverts to 'anon' behavior)
+            gcreds.credentials = None
+            return False
+
+    @classmethod
+    def _get_default_credentials(cls, **kwargs) -> dict[str, Any]:
+        """Get default GCS credentials using various authentication methods.
+
+        Returns:
+            dict: Updated kwargs with appropriate token
+        """
+        from gcsfs.core import DEFAULT_PROJECT
+        from gcsfs.credentials import GoogleCredentials
+        from google.auth.compute_engine._metadata import is_on_gce
+
+        def request_callback(*args, **kwargs):
+            from google.auth.exceptions import TransportError
+
+            raise TransportError("Skip metadata check")
+
+        # If credentials provided in env var, use those
         if os.environ.get("DATACHAIN_GCP_CREDENTIALS"):
             kwargs["token"] = json.loads(os.environ["DATACHAIN_GCP_CREDENTIALS"])
-        if kwargs.pop("anon", False):
-            kwargs["token"] = "anon"  # noqa: S105
+            return kwargs
+
+        # If token is provided, use it
+        if "token" in kwargs:
+            return kwargs
+
+        # If anonymous access requested, force anonymous access
+        if kwargs.get("anon"):
+            kwargs["token"] = ANONYMOUS_TOKEN
+            return kwargs
+
+        # Try various authentication methods
+        gcreds = GoogleCredentials(
+            token=ANONYMOUS_TOKEN,
+            project=DEFAULT_PROJECT,
+            access="full_control",
+        )
+
+        # Define authentication methods to try
+        auth_methods = ["google_default", "cache"]
+        if is_on_gce(request_callback):
+            auth_methods.append("cloud")
+
+        # Try each authentication method
+        for method in auth_methods:
+            if cls._try_authenticate(gcreds, method):
+                return kwargs
 
+        # If no authentication method worked, use anonymous access
+        kwargs["token"] = ANONYMOUS_TOKEN
+        return kwargs
+
+    @classmethod
+    def create_fs(cls, **kwargs) -> GCSFileSystem:
+        """Create a GCS filesystem with appropriate authentication.
+
+        Returns:
+            GCSFileSystem: Authenticated filesystem object
+        """
+        kwargs = cls._get_default_credentials(**kwargs)
         return cast("GCSFileSystem", super().create_fs(**kwargs))
 
     def url(self, path: str, expires: int = 3600, **kwargs) -> str:

amritghimire avatar Mar 20 '25 11:03 amritghimire

The difference in time with the following script with clean .datachain is as:

import time
from datachain import DataChain


start = time.time()
images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg")
images.show()
end = time.time()
print(f"Time taken: {end - start} seconds")

Previous:

Time taken: 27.73566699028015 seconds

Now:

Time taken: 6.101557016372681 seconds

amritghimire avatar Mar 20 '25 12:03 amritghimire