pangeo-cloud-federation icon indicating copy to clipboard operation
pangeo-cloud-federation copied to clipboard

dask pod service account access to non public storage (s3, gs buckets)

Open scottyhq opened this issue 5 years ago • 9 comments

For better security and cost-savings we are moving towards non-public (requester pays) buckets for data storage. To access these buckets on AWS we recently reconfigured the hubs to assign an IAM Role to Kubernetes Service Account. Specifically, the daskkubernetes service account gets an iam role that has a policy for accessing specific buckets in the same region. The daskkubernetes service account gets assigned to jupyterhub users in the pangeo helm chart here: `https://github.com/pangeo-data/helm-chart/blob/56dc755ed0b56ad00571373d70c7fe0eaae5d556/pangeo/values.yaml#L25

This works great for pulling data into a jupyter session, but we're currently encountering errors when loading data with dask workers via s3fs/fsspec. The errors are not always clear as to a permissions issue: returned non-zero exit status 127. and KilledWorker: ('zarr-df194f82d92e97d5d5e60f0de5da8a42', <Worker 'tcp://192.168.169.195:33807', memory: 0, processing: 3>) .

I think the root if the issue is that Dask worker pods currently are assigned the default service account and therefore do not have permissions for accessing non public pangeo datasets. kubectl get pod -o yaml -n binder-staging dask-scottyhq-pangeo-binder-test-xg8nlaic-f8372c69-9mmg6m | grep service

    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
  serviceAccount: default
  serviceAccountName: default

One solution is linking cloud-provider permissions to the default service account, but should we instead create a new service account exclusively for dask worker pods?

pinging @jacobtomlinson @TomAugspurger and @martindurant per @rsignell-usgs and @jhamman 's suggestion

scottyhq avatar Dec 02 '19 21:12 scottyhq

I think the root if the issue is that Dask worker pods currently are assigned the default service account and therefore do not have permissions for accessing non public pangeo datasets.

To verify this, you could try

def func():
    # function to load the data. Something like
    fs = s3fs.S3FileSystem()  # rely on the service account
    fs.open("path/to/private/object")

In theory, func() should work on the client, but client.run(func) would fail.

TomAugspurger avatar Dec 02 '19 21:12 TomAugspurger

Thanks @TomAugspurger - forgot to include a code block! Here is output from your test case run on the aws-uswest2 hub:

(s3fs=0.4, dask=2.8.1, botocore=1.13.29)

def func():
    import s3fs
    # function to load the data. Something like
    fs = s3fs.S3FileSystem()  # rely on the service account
    fs.open("pangeo-data-uswest2/esip/NWM2/2017")

client.run(func)
/srv/conda/envs/notebook/lib/python3.7/site-packages/botocore/auth.py in add_auth()
    355     def add_auth(self, request):
    356         if self.credentials is None:
--> 357             raise NoCredentialsError
    358         datetime_now = datetime.datetime.utcnow()
    359         request.context['timestamp'] = datetime_now.strftime(SIGV4_TIMESTAMP)

NoCredentialsError: Unable to locate credentials

Note also that the AWS docs suggest a minimum cli version of 1.16.283 to resolve credentials via the service account, which seems to install botocore 1.13.19.

scottyhq avatar Dec 02 '19 21:12 scottyhq

It would make sense to me if the dask workers and the normal user interactive pods had the same ownership and permissions. The only difference is that a dask worker would not normally want to create new pods (but it perhaps could).

Is the above situation with dask-kubernetes or dask-gateway?

martindurant avatar Dec 02 '19 21:12 martindurant

It would make sense to me if the dask workers and the normal user interactive pods had the same ownership and permissions. The only difference is that a dask worker would not normally want to create new pods (but it perhaps could).

Agreed. Is it possible for any dask pods created by a user pod to inherit the same service account? A short-term easy fix is to assign all dask pods the daskkubernetes service account in some dask config setting (here? https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/dask_config.yaml#L31). But further down the line it would be useful for each user to have a unique service account / iam role (for granular permissions and cost-tracking), and then it would be best for dask pods to inherit.

Is the above situation with dask-kubernetes or dask-gateway?

dask-kubernetes.

Still haven't tried with dask-gateway. Maybe @jhamman has?

scottyhq avatar Dec 02 '19 22:12 scottyhq

I suspect dask-gateway does the right thing here, and yes, I know that trials are underway, but I don't know how far they have progressed. @jcrist would also know both these things.

martindurant avatar Dec 02 '19 22:12 martindurant

A short-term easy fix is to assign all dask pods the daskkubernetes service account in some dask config setting (here? https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/dask_config.yaml#L31).

Yeah, that should work. This wouldn't be any less secure than the status-quo, and should get things working for now.

But further down the line it would be useful for each user to have a unique service account / iam role (for granular permissions and cost-tracking), and then it would be best for dask pods to inherit.

This should be doable with dask-gateway, but nothing is builtin. How would you map usernames to IAM roles/service accounts? If there's a way to do this where dask-gateway doesn't need to store and manage this mapping then this should be fairly easy to hack up with no additional changes to the gateway core itself.

jcrist avatar Dec 02 '19 22:12 jcrist

How would you map usernames to IAM roles/service accounts? If there's a way to do this where dask-gateway doesn't need to store and manage this mapping then this should be fairly easy to hack up with no additional changes to the gateway core itself.

I don't think there is a straightforward way to do this currently in Zero2JupyterHubK8s config. See https://github.com/dask/dask-kubernetes/issues/202#issuecomment-546864643 and https://github.com/jupyterhub/kubespawner/pull/304.

  1. If 304 linked above is merged, it would be straightforward to create a per-user IAM Role as part of a pod startup script and link it to the service account in the per-user namespace https://docs.aws.amazon.com/eks/latest/userguide/specify-service-account-role.html

  2. Alternatively it seems possible to have an 'assume role' API call as part of a startup script and inject temporary credentials as environment variables https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1103

scottyhq avatar Dec 02 '19 23:12 scottyhq

I think you could do this right now by configuring a post_auth_hook to create a new serviceaccount/IAM role for the user (if not already created). The serviceaccount could then be configured for the notebook by adding a modify_pod_hook (alternatively these could be combined to just a modify_pod_hook, probably fine either way). This would allow jupyterhub to manage creating the service accounts per user. I don't think a separate namespace per user would be needed at all here, but may be wrong.

jcrist avatar Dec 03 '19 00:12 jcrist

In a recent chat with @yuvipanda - he pointed me to a nice model for provisioning per-user policies and buckets on GCP that would be relevant once we get around to trying some of the suggested approaches in this issue https://github.com/berkeley-dsep-infra/datahub/blob/staging/images/hub/sparklyspawner/sparklyspawner/init.py

scottyhq avatar Jan 06 '20 22:01 scottyhq