streaming icon indicating copy to clipboard operation
streaming copied to clipboard

GCS Auth ERROR/Download timeout

Open rishabhm12 opened this issue 5 months ago • 0 comments

ENV:

  • Ubuntu 22.04.4 LTS
  • a2-ultragpu-8g ( 8 a100)
  • torch==1.13.1
  • DDP
  • Data resides in gcs and we are using a service account

I am trying to stream data and I have three data_loaders which are being used for alternate learning of the model. One of the datasets is an <image, text> data having 200M+ <image, text> pairs. I run into the following error when num_workers per data_loader >= 4

File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1311, in on_exception raise exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1361, in _prepare_thread self.prepare_shard(shard_id, False) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1192, in prepare_shard delta = stream.prepare_shard(shard) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 422, in prepare_shard delta += self._prepare_shard_part(raw_info, zip_info, shard.compression) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 396, in _prepare_shard_part self._download_file(raw_info.basename) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 311, in _download_file retry(num_attempts=self.download_retry)( File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/util.py", line 525, in new_func raise e File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/util.py", line 521, in new_func return func(*args, **kwargs) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 312, in <lambda> lambda: download_file(remote, local, self.download_timeout))() File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/storage/download.py", line 542, in download_file download_from_gcs(remote, local) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/storage/download.py", line 197, in download_from_gcs raise ValueError(GCS_ERROR_NO_AUTHENTICATION) ValueError: Either set the environment variablesGCS_KEYandGCS_SECRET or use any of the methods in https://cloud.google.com/docs/authentication/external/set-up-adc to set up Application Default Credentials. See also https://docs.mosaicml.com/projects/mcli/en/latest/resources/secrets/gcp.html

The env has all the necessary permissions to access gcs. The error mitigates when num_workers per data_loader are reduced. This issue seems very closely related to Issues/728.

  • Scalibility to more # gpu processes will become a problem?

rishabhm12 avatar Sep 05 '24 07:09 rishabhm12