streaming
streaming copied to clipboard
GCS Auth ERROR/Download timeout
ENV:
- Ubuntu 22.04.4 LTS
- a2-ultragpu-8g ( 8 a100)
- torch==1.13.1
- DDP
- Data resides in gcs and we are using a service account
I am trying to stream data and I have three data_loaders which are being used for alternate learning of the model. One of the datasets is an <image, text> data having 200M+ <image, text> pairs. I run into the following error when num_workers per data_loader >= 4
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1311, in on_exception raise exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1361, in _prepare_thread self.prepare_shard(shard_id, False) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1192, in prepare_shard delta = stream.prepare_shard(shard) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 422, in prepare_shard delta += self._prepare_shard_part(raw_info, zip_info, shard.compression) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 396, in _prepare_shard_part self._download_file(raw_info.basename) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 311, in _download_file retry(num_attempts=self.download_retry)( File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/util.py", line 525, in new_func raise e File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/util.py", line 521, in new_func return func(*args, **kwargs) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 312, in <lambda> lambda: download_file(remote, local, self.download_timeout))() File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/storage/download.py", line 542, in download_file download_from_gcs(remote, local) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/storage/download.py", line 197, in download_from_gcs raise ValueError(GCS_ERROR_NO_AUTHENTICATION) ValueError: Either set the environment variables
GCS_KEYand
GCS_SECRET or use any of the methods in https://cloud.google.com/docs/authentication/external/set-up-adc to set up Application Default Credentials. See also https://docs.mosaicml.com/projects/mcli/en/latest/resources/secrets/gcp.html
The env has all the necessary permissions to access gcs. The error mitigates when num_workers per data_loader are reduced. This issue seems very closely related to Issues/728.
- Scalibility to more # gpu processes will become a problem?