dask-cloudprovider
dask-cloudprovider copied to clipboard
GCP ValueError: Service account info was not in the expected format, missing fields token_uri
Trying to create a GCPCluster after specifying my GCP projectid in ~/.config/dask/cloudprovider.yaml I get below error related to ValueError: Service account info was not in the expected format, missing fields token_uri.
from dask_cloudprovider.gcp import GCPCluster
from dask.distributed import wait
cluster = GCPCluster(
zone="us-central1-a",
machine_type="n1-standard-8",
n_workers=1,
docker_image="rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8",
worker_class="dask_cuda.CUDAWorker",
env_vars={"EXTRA_PIP_PACKAGES": "gcsfs"}
)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/google/auth/_default.py in load_credentials_from_file(filename, scopes, quota_project_id)
134 credentials = service_account.Credentials.from_service_account_info(
--> 135 info, scopes=scopes
136 )
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/google/oauth2/service_account.py in from_service_account_info(cls, info, **kwargs)
210 signer = _service_account_info.from_dict(
--> 211 info, require=["client_email", "token_uri"]
212 )
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/google/auth/_service_account_info.py in from_dict(data, require)
50 "Service account info was not in the expected format, missing "
---> 51 "fields {}.".format(", ".join(missing))
52 )
ValueError: Service account info was not in the expected format, missing fields token_uri.
The above exception was the direct cause of the following exception:
DefaultCredentialsError Traceback (most recent call last)
<ipython-input-23-066423d9340a> in <module>
17 docker_image="rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8",
18 worker_class="dask_cuda.CUDAWorker",
---> 19 env_vars={"EXTRA_PIP_PACKAGES": "gcsfs"}
20 )
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask_cloudprovider/gcp/instances.py in __init__(self, projectid, zone, machine_type, source_image, docker_image, ngpus, gpu_type, filesystem_size, auto_shutdown, bootstrap, **kwargs)
512 ):
513
--> 514 self.compute = GCPCompute()
515
516 self.config = dask.config.get("cloudprovider.gcp", {})
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask_cloudprovider/gcp/instances.py in __init__(self)
551
552 def __init__(self):
--> 553 self._compute = self.refresh_client()
554
555 def refresh_client(self):
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask_cloudprovider/gcp/instances.py in refresh_client(self)
570 # take first row
571 f_.write(creds_rows[0][1])
--> 572 creds, _ = google.auth.load_credentials_from_file(filename=f)
573 return googleapiclient.discovery.build("compute", "v1", credentials=creds)
574
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/google/auth/_default.py in load_credentials_from_file(filename, scopes, quota_project_id)
138 msg = "Failed to load service account credentials from {}".format(filename)
139 new_exc = exceptions.DefaultCredentialsError(msg, caught_exc)
--> 140 six.raise_from(new_exc, caught_exc)
141 if quota_project_id:
142 credentials = credentials.with_quota_project(quota_project_id)
~/.local/lib/python3.7/site-packages/six.py in raise_from(value, from_value)
DefaultCredentialsError: ('Failed to load service account credentials from /tmp/tmp79wsxl8a.', ValueError('Service account info was not in the expected format, missing fields token_uri.'))
Anything else we need to know?:
Environment:
- Dask version: 2.30.0
- Python version: 3.7.9
- Operating System: Ubuntu 18.04.4 LTS
- Install method (conda, pip, source): conda
Bit more context: I have set my GCP credentials using gcloud auth login in the terminal. The GCP documentation seems to suggest that you can either use gcloud auth login or use service account and pass credentials export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json.
But even after using gcloud auth login I seem to be getting error specific to service account credentials. I am passing service account credentials but perhaps making a mistake with it.
Yes, you should be able to use either gcloud auth or the env var. When authenticating with gcloud auth, it should drop a sqlite db here: ~/.config/gcloud/credentials.db . What version of gcloud are you using ?
thanks @quasiben. I see the the file ls -lah ~/.config/gcloud/credentials.db. Where does this file get used?
$ gcloud --version
Google Cloud SDK 303.0.0
alpha 2020.07.24
beta 2020.07.24
bq 2.0.58
core 2020.07.24
gsutil 4.52
kubectl 1.15.11
I am using Google Cloud SDK 315.0.0 -- at some point google made the switch from pickled json objects to sqlite though I am not sure when that happened.
This is all used in the auth section of cloudprovider:
https://github.com/dask/dask-cloudprovider/blob/53d3c92098ff58029d1d98041b38d3eebf9c7713/dask_cloudprovider/gcp/instances.py#L549-L573
If you are using a conda env you could upgrade easily with https://anaconda.org/conda-forge/google-cloud-sdk ?
updated to Google Cloud SDK 319.0.0 but still replicating this issue. ValueError: Service account info was not in the expected format, missing fields token_uri
Hmm, can you try re-authenticating with the new version ?
yes, re-authenticated but same issue.
Thanks for your patience here @roarjn -- maybe we need to activate a service account associated with your login?
`gcloud auth activate-service-account -- from https://cloud.google.com/iam/docs/creating-managing-service-account-keys
@bradmiro do you happen to know this off hand ? If not, no worries
Hey @roarjn, sorry you're having issues! I have two preliminary suggestions:
- Can you try manually deleting the
~/.config/gcloud/credentials.dbfile and try re-authenticating? - If that didn't work, can you try this with a service account (following the link @quasiben provided) and setting
GOOGLE_APPLICATION_CREDENTIALSin your environment, pointing to your new key?
Thanks for chiming in @bradmiro !
Thank you so much @bradmiro and @quasiben. Brad's suggestion #1 worked after I deleted the ~/.config/gcloud/credentials.db and re-authenticated:
rm ~/.config/gcloud/credentials.db
gcloud auth login
Glad it's working for you now @roarjn !
Huzzah!!!
Closing now.
Couple more issues i am running into. Appreciate any suggestion.
- trying to read_csv from GCS with below steps seeing this error. Corresponding read from S3 works in my setup.
$ `data = dask_cudf.read_csv( "gs://airlines-2019.csv")
gcsfs.utils.HttpError: Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.
- Also, I am looking to specify GPU machine_type. For A100 instance I can see the machine type can be
machine_type="a2-highgpu-2g". I am trying to find the machine_type for a T4 GCP instance. Is there a way to specifyaccelerator_typein GCPCluster? Tried below:
cluster = GCPCluster(
zone="us-central1-a",
machine_type="n1-standard-8",
# machine_type="a2-highgpu-1g", # launches an A100 GPU instance
gpu_type="nvidia-tesla-t4", # Test for a T4 instance
n_workers=1,
docker_image="rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8",
worker_class="dask_cuda.CUDAWorker",
env_vars={"EXTRA_PIP_PACKAGES": "gcsfs"}
)
which resulted in:
TypeError: Parameter "project" value "" does not match the pattern "(?:(?:[-a-z0-9]{1,63}\.)*(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?):)?(?:[0-9]{1,19}|(?:[a-z0-9](?:[-a-z0-9]{0,61}[a-z0-9])?))"
gcsfs has a number of credential options, I think you might only need token=cloud. Check the gcsfs creds section:
https://gcsfs.readthedocs.io/en/latest/index.html#credentials
df = pd.read_excel("gcs://bucket/path/file.xls",
storage_options={"token": "cloud"})
If that doesn't work I think we can refer to some solution listed in this issue: https://github.com/dask/gcsfs/issues/231
For #2 you may be missing projectid: https://github.com/dask/dask-cloudprovider/blob/53d3c92098ff58029d1d98041b38d3eebf9c7713/dask_cloudprovider/gcp/instances.py#L356-L365
I agree with @bradmiro it looks like your project ID is not correct or missing.
Once you have that fixed you are on the right track with configuring accelerators. Here are a few additional comments:
- The
a2instance type family implies A100 GPUs, this is tightly coupled. - Other instance families such as
n1can have accelerators assigned optionally, so you need to use thegpu_typeandngpusarguments to specify which accelerator to use and how many. - You can list available accelerators with
gcloud compute accelerator-types list.
Appreciate the help again.
Couple of observations:
- nvidia driver comes up after ~4-5mins of launching. So user will need to be bit patient as nvidia driver installation is working even if you can ssh to worker.
- Same with the docker image you are using - it might take a bit for it to be available (even if you can ssh to scheduler/worker right away and try to find your docker image).
I am not getting launching GCPCluster error here anymore with below config https://github.com/dask/dask-cloudprovider/issues/211#issuecomment-740783677
Still having issues trying to read from GCS. Here are steps I took and current error trace:
Created file ~/.config/gcloud/application_default_credentials.json using gcloud auth application-default login as per https://github.com/dask/gcsfs/issues/231#issuecomment-607523449
# Create dask-cloudprovider GCPCluster - _This config works with T4s on scheduler/worker_
from dask_cloudprovider.gcp import GCPCluster
from dask.distributed import wait
cluster = GCPCluster(
projectid="myproject",
zone="us-west1-a",
machine_type="n1-standard-4",
gpu_type="nvidia-tesla-t4", # Test for a T4 instance
ngpus=1,
n_workers=1,
docker_image="rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8",
worker_class="dask_cuda.CUDAWorker",
env_vars={"EXTRA_PIP_PACKAGES": "gcsfs"}
)
I can see my GCS bucket here: from gcsfs.core import GCSFileSystem gcs = GCSFileSystem('my_project', token='/home/user/.config/gcloud/application_default_credentials.json') gcs.ls('gs://my_project') -- displays files in my bucket --
data = dask_cudf.read_csv("gs://crisp-sa/airlines-2019.csv", storage_options={"token": "/home/user/.config/gcloud/application_default_credentials.json"}
-- read_csv seems to work - no error --
data.head()
-- get this error trace --
---------------------------------------------------------------------------
KilledWorker Traceback (most recent call last)
<ipython-input-73-80c7d069447b> in <module>
----> 1 data.head(5)
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask/dataframe/core.py in head(self, n, npartitions, compute)
1004 Whether to compute the result, default is True.
1005 """
-> 1006 return self._head(n=n, npartitions=npartitions, compute=compute, safe=True)
1007
1008 def _head(self, n, npartitions, compute, safe):
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask/dataframe/core.py in _head(self, n, npartitions, compute, safe)
1037
1038 if compute:
-> 1039 result = result.compute()
1040 return result
1041
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask/base.py in compute(self, **kwargs)
165 dask.base.compute
166 """
--> 167 (result,) = compute(self, traverse=False, **kwargs)
168 return result
169
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask/base.py in compute(*args, **kwargs)
450 postcomputes.append(x.__dask_postcompute__())
451
--> 452 results = schedule(dsk, keys, **kwargs)
453 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
454
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
2723 should_rejoin = False
2724 try:
-> 2725 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
2726 finally:
2727 for f in futures.values():
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
1990 direct=direct,
1991 local_worker=local_worker,
-> 1992 asynchronous=asynchronous,
1993 )
1994
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
831 else:
832 return sync(
--> 833 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
834 )
835
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
338 if error[0]:
339 typ, exc, tb = error[0]
--> 340 raise exc.with_traceback(tb)
341 else:
342 return result[0]
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/utils.py in f()
322 if callback_timeout is not None:
323 future = asyncio.wait_for(future, callback_timeout)
--> 324 result[0] = yield future
325 except Exception as exc:
326 error[0] = sys.exc_info()
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1849 exc = CancelledError(key)
1850 else:
-> 1851 raise exception.with_traceback(traceback)
1852 raise exc
1853 if errors == "skip":
KilledWorker: ("('read-csv-head-1-5-read-csv-fc59da1ea42a6c356ff1773434f08a44', 0)", <Worker 'tcp://10.138.15.199:45015', name: dask-384b5a7d-worker-ca3ef9fd, memory: 0, processing: 1>)
Interestingly this works:
data.shape
(Delayed('int-5bf13165-6210-4124-a716-0c36ab4ab268'), 26)
The speed you observe is expected. You can cache a lot of this with Packer, but the default behaviour is to bootstrap each machine every time.
I expect the issue you are seeing is because the path /home/user/.config/gcloud/application_default_credentials.json doesn't exist on the workers.
You could use a code snippet like this to copy your credentials to the workers.
def write_credentials(credentials):
with open("/tmp/application_default_credentials.json", "w+") as wfh:
wfh.write(credentials)
with open("/home/user/.config/gcloud/application_default_credentials.json") as fh:
client.run(write_credentials, fh.read())
Then you would set the location of the worker credentials file in your read_csv call.
data = dask_cudf.read_csv("gs://crisp-sa/airlines-2019.csv", storage_options={"token": "/tmp/application_default_credentials.json"}
You could probably do this with client.upload_file too but I think that's intended for uploading code so I'm not sure how you would figure out the path to the file on the workers.
Also in the case where you see KilledWorker you may find cluster.get_logs() helpful to see more about what happened.
Thanks @jacobtomlinson .. trying below I get this error. Any suggestions?
def write_credentials(credentials):
with open("./application_default_credentials.json", "w+") as wfh:
wfh.write(credentials)
with open("/home/user/.config/gcloud/application_default_credentials.json") as fh:
client.run(write_credentials, fh.read())
Exception: an integer is required (got type bytes)
Can you post the full traceback ?
Below is the trace. When I ssh to worker and try to test python version it looks like it's not installed.
[email protected]:~$ python --version
-bash: python: command not found
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-35-66777a445406> in <module>
4
5 with open("/home/dgxuser/.config/gcloud/application_default_credentials.json") as fh:
----> 6 client.run(write_credentials)#, fh.read())
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in run(self, function, *args, **kwargs)
2510 >>> c.run(print_state, wait=False) # doctest: +SKIP
2511 """
-> 2512 return self.sync(self._run, function, *args, **kwargs)
2513
2514 def run_coroutine(self, function, *args, **kwargs):
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
831 else:
832 return sync(
--> 833 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
834 )
835
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
338 if error[0]:
339 typ, exc, tb = error[0]
--> 340 raise exc.with_traceback(tb)
341 else:
342 return result[0]
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/utils.py in f()
322 if callback_timeout is not None:
323 future = asyncio.wait_for(future, callback_timeout)
--> 324 result[0] = yield future
325 except Exception as exc:
326 error[0] = sys.exc_info()
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in _run(self, function, nanny, workers, wait, *args, **kwargs)
2439 ),
2440 workers=workers,
-> 2441 nanny=nanny,
2442 )
2443 results = {}
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/core.py in send_recv_from_rpc(**kwargs)
881 name, comm.name = comm.name, "ConnectionPool." + key
882 try:
--> 883 result = await send_recv(comm=comm, op=key, **kwargs)
884 finally:
885 self.pool.reuse(self.addr, comm)
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/core.py in send_recv(comm, reply, serializers, deserializers, **kwargs)
680 if comm.deserialize:
681 typ, exc, tb = clean_exception(**response)
--> 682 raise exc.with_traceback(tb)
683 else:
684 raise Exception(response["text"])
/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/core.py in handle_comm()
/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/scheduler.py in broadcast()
/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py in All()
/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/scheduler.py in send_message()
/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/core.py in send_recv()
Exception: an integer is required (got type bytes)
Below is the trace. When I ssh to worker and try to test python version it looks like it's not installed.
Remember, you are starting up a docker container so you need to exec into the container.
Is it possible writing in the docker container is causing a problem but we don't see the exception nicely ? can you try writing to /tmp/NAME.json ?
I am able to manually copy the ./application_default_credentials.json file into the workers nodes and it needs to be in the /tmp/application_default_credentials.json location inside the docker container running on workers. It works once this file is present.
Is there a more elegant approach to copying the json file so the docker container can access it?