dask-cloudprovider icon indicating copy to clipboard operation
dask-cloudprovider copied to clipboard

GCP ValueError: Service account info was not in the expected format, missing fields token_uri

Open roarjn opened this issue 4 years ago • 25 comments

Trying to create a GCPCluster after specifying my GCP projectid in ~/.config/dask/cloudprovider.yaml I get below error related to ValueError: Service account info was not in the expected format, missing fields token_uri.

from dask_cloudprovider.gcp import GCPCluster
from dask.distributed import wait

cluster = GCPCluster(
    zone="us-central1-a",
    machine_type="n1-standard-8",
    n_workers=1,
    docker_image="rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8",
    worker_class="dask_cuda.CUDAWorker",
    env_vars={"EXTRA_PIP_PACKAGES": "gcsfs"}
)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/anaconda3/envs/daskcp/lib/python3.7/site-packages/google/auth/_default.py in load_credentials_from_file(filename, scopes, quota_project_id)
    134             credentials = service_account.Credentials.from_service_account_info(
--> 135                 info, scopes=scopes
    136             )

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/google/oauth2/service_account.py in from_service_account_info(cls, info, **kwargs)
    210         signer = _service_account_info.from_dict(
--> 211             info, require=["client_email", "token_uri"]
    212         )

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/google/auth/_service_account_info.py in from_dict(data, require)
     50             "Service account info was not in the expected format, missing "
---> 51             "fields {}.".format(", ".join(missing))
     52         )

ValueError: Service account info was not in the expected format, missing fields token_uri.

The above exception was the direct cause of the following exception:

DefaultCredentialsError                   Traceback (most recent call last)
<ipython-input-23-066423d9340a> in <module>
     17     docker_image="rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8",
     18     worker_class="dask_cuda.CUDAWorker",
---> 19     env_vars={"EXTRA_PIP_PACKAGES": "gcsfs"}
     20 )

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask_cloudprovider/gcp/instances.py in __init__(self, projectid, zone, machine_type, source_image, docker_image, ngpus, gpu_type, filesystem_size, auto_shutdown, bootstrap, **kwargs)
    512     ):
    513 
--> 514         self.compute = GCPCompute()
    515 
    516         self.config = dask.config.get("cloudprovider.gcp", {})

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask_cloudprovider/gcp/instances.py in __init__(self)
    551 
    552     def __init__(self):
--> 553         self._compute = self.refresh_client()
    554 
    555     def refresh_client(self):

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask_cloudprovider/gcp/instances.py in refresh_client(self)
    570                     # take first row
    571                     f_.write(creds_rows[0][1])
--> 572                 creds, _ = google.auth.load_credentials_from_file(filename=f)
    573             return googleapiclient.discovery.build("compute", "v1", credentials=creds)
    574 

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/google/auth/_default.py in load_credentials_from_file(filename, scopes, quota_project_id)
    138             msg = "Failed to load service account credentials from {}".format(filename)
    139             new_exc = exceptions.DefaultCredentialsError(msg, caught_exc)
--> 140             six.raise_from(new_exc, caught_exc)
    141         if quota_project_id:
    142             credentials = credentials.with_quota_project(quota_project_id)

~/.local/lib/python3.7/site-packages/six.py in raise_from(value, from_value)

DefaultCredentialsError: ('Failed to load service account credentials from /tmp/tmp79wsxl8a.', ValueError('Service account info was not in the expected format, missing fields token_uri.'))

Anything else we need to know?:

Environment:

  • Dask version: 2.30.0
  • Python version: 3.7.9
  • Operating System: Ubuntu 18.04.4 LTS
  • Install method (conda, pip, source): conda

roarjn avatar Dec 07 '20 17:12 roarjn

Bit more context: I have set my GCP credentials using gcloud auth login in the terminal. The GCP documentation seems to suggest that you can either use gcloud auth login or use service account and pass credentials export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json.

But even after using gcloud auth login I seem to be getting error specific to service account credentials. I am passing service account credentials but perhaps making a mistake with it.

roarjn avatar Dec 07 '20 20:12 roarjn

Yes, you should be able to use either gcloud auth or the env var. When authenticating with gcloud auth, it should drop a sqlite db here: ~/.config/gcloud/credentials.db . What version of gcloud are you using ?

quasiben avatar Dec 07 '20 20:12 quasiben

thanks @quasiben. I see the the file ls -lah ~/.config/gcloud/credentials.db. Where does this file get used?

$ gcloud --version
Google Cloud SDK 303.0.0
alpha 2020.07.24
beta 2020.07.24
bq 2.0.58
core 2020.07.24
gsutil 4.52
kubectl 1.15.11

roarjn avatar Dec 07 '20 20:12 roarjn

I am using Google Cloud SDK 315.0.0 -- at some point google made the switch from pickled json objects to sqlite though I am not sure when that happened.

This is all used in the auth section of cloudprovider:

https://github.com/dask/dask-cloudprovider/blob/53d3c92098ff58029d1d98041b38d3eebf9c7713/dask_cloudprovider/gcp/instances.py#L549-L573

If you are using a conda env you could upgrade easily with https://anaconda.org/conda-forge/google-cloud-sdk ?

quasiben avatar Dec 07 '20 21:12 quasiben

updated to Google Cloud SDK 319.0.0 but still replicating this issue. ValueError: Service account info was not in the expected format, missing fields token_uri

roarjn avatar Dec 07 '20 21:12 roarjn

Hmm, can you try re-authenticating with the new version ?

quasiben avatar Dec 07 '20 22:12 quasiben

yes, re-authenticated but same issue.

roarjn avatar Dec 08 '20 00:12 roarjn

Thanks for your patience here @roarjn -- maybe we need to activate a service account associated with your login?

`gcloud auth activate-service-account -- from https://cloud.google.com/iam/docs/creating-managing-service-account-keys

@bradmiro do you happen to know this off hand ? If not, no worries

quasiben avatar Dec 08 '20 02:12 quasiben

Hey @roarjn, sorry you're having issues! I have two preliminary suggestions:

  1. Can you try manually deleting the ~/.config/gcloud/credentials.db file and try re-authenticating?
  2. If that didn't work, can you try this with a service account (following the link @quasiben provided) and setting GOOGLE_APPLICATION_CREDENTIALS in your environment, pointing to your new key?

bradmiro avatar Dec 08 '20 05:12 bradmiro

Thanks for chiming in @bradmiro !

quasiben avatar Dec 08 '20 14:12 quasiben

Thank you so much @bradmiro and @quasiben. Brad's suggestion #1 worked after I deleted the ~/.config/gcloud/credentials.db and re-authenticated:

rm ~/.config/gcloud/credentials.db
gcloud auth login

roarjn avatar Dec 08 '20 16:12 roarjn

Glad it's working for you now @roarjn !

bradmiro avatar Dec 08 '20 16:12 bradmiro

Huzzah!!!

Closing now.

quasiben avatar Dec 08 '20 16:12 quasiben

Couple more issues i am running into. Appreciate any suggestion.

  1. trying to read_csv from GCS with below steps seeing this error. Corresponding read from S3 works in my setup. $ `data = dask_cudf.read_csv( "gs://airlines-2019.csv")

gcsfs.utils.HttpError: Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.

  1. Also, I am looking to specify GPU machine_type. For A100 instance I can see the machine type can be machine_type="a2-highgpu-2g". I am trying to find the machine_type for a T4 GCP instance. Is there a way to specify accelerator_type in GCPCluster? Tried below:
cluster = GCPCluster(
    zone="us-central1-a",
    machine_type="n1-standard-8",
#     machine_type="a2-highgpu-1g", # launches an A100 GPU instance
    gpu_type="nvidia-tesla-t4",  # Test for a T4 instance
    n_workers=1,
    docker_image="rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8",
    worker_class="dask_cuda.CUDAWorker",
    env_vars={"EXTRA_PIP_PACKAGES": "gcsfs"}
)

which resulted in: TypeError: Parameter "project" value "" does not match the pattern "(?:(?:[-a-z0-9]{1,63}\.)*(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?):)?(?:[0-9]{1,19}|(?:[a-z0-9](?:[-a-z0-9]{0,61}[a-z0-9])?))"

roarjn avatar Dec 08 '20 17:12 roarjn

gcsfs has a number of credential options, I think you might only need token=cloud. Check the gcsfs creds section: https://gcsfs.readthedocs.io/en/latest/index.html#credentials

df = pd.read_excel("gcs://bucket/path/file.xls",
                   storage_options={"token": "cloud"})

If that doesn't work I think we can refer to some solution listed in this issue: https://github.com/dask/gcsfs/issues/231

quasiben avatar Dec 08 '20 18:12 quasiben

For #2 you may be missing projectid: https://github.com/dask/dask-cloudprovider/blob/53d3c92098ff58029d1d98041b38d3eebf9c7713/dask_cloudprovider/gcp/instances.py#L356-L365

bradmiro avatar Dec 08 '20 23:12 bradmiro

I agree with @bradmiro it looks like your project ID is not correct or missing.

Once you have that fixed you are on the right track with configuring accelerators. Here are a few additional comments:

  • The a2 instance type family implies A100 GPUs, this is tightly coupled.
  • Other instance families such as n1 can have accelerators assigned optionally, so you need to use the gpu_type and ngpus arguments to specify which accelerator to use and how many.
  • You can list available accelerators with gcloud compute accelerator-types list.

jacobtomlinson avatar Dec 09 '20 11:12 jacobtomlinson

Appreciate the help again.

Couple of observations:

  • nvidia driver comes up after ~4-5mins of launching. So user will need to be bit patient as nvidia driver installation is working even if you can ssh to worker.
  • Same with the docker image you are using - it might take a bit for it to be available (even if you can ssh to scheduler/worker right away and try to find your docker image).

I am not getting launching GCPCluster error here anymore with below config https://github.com/dask/dask-cloudprovider/issues/211#issuecomment-740783677

Still having issues trying to read from GCS. Here are steps I took and current error trace: Created file ~/.config/gcloud/application_default_credentials.json using gcloud auth application-default login as per https://github.com/dask/gcsfs/issues/231#issuecomment-607523449

# Create dask-cloudprovider GCPCluster - _This config works with T4s on scheduler/worker_
from dask_cloudprovider.gcp import GCPCluster
from dask.distributed import wait

cluster = GCPCluster(
    projectid="myproject",
    zone="us-west1-a",
    machine_type="n1-standard-4",
    gpu_type="nvidia-tesla-t4",  # Test for a T4 instance
    ngpus=1,
    n_workers=1,
    docker_image="rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8",
    worker_class="dask_cuda.CUDAWorker",
    env_vars={"EXTRA_PIP_PACKAGES": "gcsfs"}
)

I can see my GCS bucket here: from gcsfs.core import GCSFileSystem gcs = GCSFileSystem('my_project', token='/home/user/.config/gcloud/application_default_credentials.json') gcs.ls('gs://my_project') -- displays files in my bucket --

data = dask_cudf.read_csv("gs://crisp-sa/airlines-2019.csv", storage_options={"token": "/home/user/.config/gcloud/application_default_credentials.json"} -- read_csv seems to work - no error --

data.head() -- get this error trace --

---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
<ipython-input-73-80c7d069447b> in <module>
----> 1 data.head(5)

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask/dataframe/core.py in head(self, n, npartitions, compute)
   1004             Whether to compute the result, default is True.
   1005         """
-> 1006         return self._head(n=n, npartitions=npartitions, compute=compute, safe=True)
   1007 
   1008     def _head(self, n, npartitions, compute, safe):

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask/dataframe/core.py in _head(self, n, npartitions, compute, safe)
   1037 
   1038         if compute:
-> 1039             result = result.compute()
   1040         return result
   1041 

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask/base.py in compute(self, **kwargs)
    165         dask.base.compute
    166         """
--> 167         (result,) = compute(self, traverse=False, **kwargs)
    168         return result
    169 

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/dask/base.py in compute(*args, **kwargs)
    450         postcomputes.append(x.__dask_postcompute__())
    451 
--> 452     results = schedule(dsk, keys, **kwargs)
    453     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    454 

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2723                     should_rejoin = False
   2724             try:
-> 2725                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2726             finally:
   2727                 for f in futures.values():

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1990                 direct=direct,
   1991                 local_worker=local_worker,
-> 1992                 asynchronous=asynchronous,
   1993             )
   1994 

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    831         else:
    832             return sync(
--> 833                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    834             )
    835 

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    338     if error[0]:
    339         typ, exc, tb = error[0]
--> 340         raise exc.with_traceback(tb)
    341     else:
    342         return result[0]

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/utils.py in f()
    322             if callback_timeout is not None:
    323                 future = asyncio.wait_for(future, callback_timeout)
--> 324             result[0] = yield future
    325         except Exception as exc:
    326             error[0] = sys.exc_info()

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1849                             exc = CancelledError(key)
   1850                         else:
-> 1851                             raise exception.with_traceback(traceback)
   1852                         raise exc
   1853                     if errors == "skip":

KilledWorker: ("('read-csv-head-1-5-read-csv-fc59da1ea42a6c356ff1773434f08a44', 0)", <Worker 'tcp://10.138.15.199:45015', name: dask-384b5a7d-worker-ca3ef9fd, memory: 0, processing: 1>)

Interestingly this works:

data.shape
(Delayed('int-5bf13165-6210-4124-a716-0c36ab4ab268'), 26)

roarjn avatar Dec 09 '20 19:12 roarjn

The speed you observe is expected. You can cache a lot of this with Packer, but the default behaviour is to bootstrap each machine every time.

I expect the issue you are seeing is because the path /home/user/.config/gcloud/application_default_credentials.json doesn't exist on the workers.

You could use a code snippet like this to copy your credentials to the workers.

def write_credentials(credentials):
    with open("/tmp/application_default_credentials.json", "w+") as wfh:
        wfh.write(credentials)

with open("/home/user/.config/gcloud/application_default_credentials.json") as fh:
    client.run(write_credentials, fh.read())

Then you would set the location of the worker credentials file in your read_csv call.

data = dask_cudf.read_csv("gs://crisp-sa/airlines-2019.csv", storage_options={"token": "/tmp/application_default_credentials.json"}

You could probably do this with client.upload_file too but I think that's intended for uploading code so I'm not sure how you would figure out the path to the file on the workers.


Also in the case where you see KilledWorker you may find cluster.get_logs() helpful to see more about what happened.

jacobtomlinson avatar Dec 10 '20 12:12 jacobtomlinson

Thanks @jacobtomlinson .. trying below I get this error. Any suggestions?

def write_credentials(credentials):
    with open("./application_default_credentials.json", "w+") as wfh:
        wfh.write(credentials)

with open("/home/user/.config/gcloud/application_default_credentials.json") as fh:
    client.run(write_credentials, fh.read())

Exception: an integer is required (got type bytes)

roarjn avatar Dec 10 '20 20:12 roarjn

Can you post the full traceback ?

quasiben avatar Dec 10 '20 20:12 quasiben

Below is the trace. When I ssh to worker and try to test python version it looks like it's not installed.

[email protected]:~$ python --version
-bash: python: command not found
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-35-66777a445406> in <module>
      4 
      5 with open("/home/dgxuser/.config/gcloud/application_default_credentials.json") as fh:
----> 6     client.run(write_credentials)#, fh.read())

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in run(self, function, *args, **kwargs)
   2510         >>> c.run(print_state, wait=False)  # doctest: +SKIP
   2511         """
-> 2512         return self.sync(self._run, function, *args, **kwargs)
   2513 
   2514     def run_coroutine(self, function, *args, **kwargs):

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    831         else:
    832             return sync(
--> 833                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    834             )
    835 

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    338     if error[0]:
    339         typ, exc, tb = error[0]
--> 340         raise exc.with_traceback(tb)
    341     else:
    342         return result[0]

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/utils.py in f()
    322             if callback_timeout is not None:
    323                 future = asyncio.wait_for(future, callback_timeout)
--> 324             result[0] = yield future
    325         except Exception as exc:
    326             error[0] = sys.exc_info()

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/client.py in _run(self, function, nanny, workers, wait, *args, **kwargs)
   2439             ),
   2440             workers=workers,
-> 2441             nanny=nanny,
   2442         )
   2443         results = {}

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/core.py in send_recv_from_rpc(**kwargs)
    881             name, comm.name = comm.name, "ConnectionPool." + key
    882             try:
--> 883                 result = await send_recv(comm=comm, op=key, **kwargs)
    884             finally:
    885                 self.pool.reuse(self.addr, comm)

~/anaconda3/envs/daskcp/lib/python3.7/site-packages/distributed/core.py in send_recv(comm, reply, serializers, deserializers, **kwargs)
    680         if comm.deserialize:
    681             typ, exc, tb = clean_exception(**response)
--> 682             raise exc.with_traceback(tb)
    683         else:
    684             raise Exception(response["text"])

/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/core.py in handle_comm()

/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/scheduler.py in broadcast()

/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/utils.py in All()

/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/scheduler.py in send_message()

/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/core.py in send_recv()

Exception: an integer is required (got type bytes)

roarjn avatar Dec 10 '20 21:12 roarjn

Below is the trace. When I ssh to worker and try to test python version it looks like it's not installed.

Remember, you are starting up a docker container so you need to exec into the container.

Is it possible writing in the docker container is causing a problem but we don't see the exception nicely ? can you try writing to /tmp/NAME.json ?

quasiben avatar Dec 10 '20 21:12 quasiben

I am able to manually copy the ./application_default_credentials.json file into the workers nodes and it needs to be in the /tmp/application_default_credentials.json location inside the docker container running on workers. It works once this file is present.

Is there a more elegant approach to copying the json file so the docker container can access it?

roarjn avatar Dec 10 '20 22:12 roarjn