s3contents icon indicating copy to clipboard operation
s3contents copied to clipboard

Do not expose S3 credentials to JupyterHub users

Open martinzugnoni opened this issue 5 years ago • 12 comments

While configuring s3contents, we need to provide credentials to connect to S3 service. We can do it by directly writing into ~/.jupyter/jupyter_notebook_config.py config file, or reading from env variables.

In any of those cases, any logged in user can read the config file o review the exposed env variables using a terminal session.

I can't find a way to securely connect to S3 without exposing the credentials to the JupyterHub users.

I'm using the dockerspawner, with the official jupyterhub/singleuser image.

Any suggestions?

Thanks.

martinzugnoni avatar Sep 10 '18 14:09 martinzugnoni

Definitely an issue. Not sure how to solve this on this side might be something you have to ask on the Jupyter side, if there is a way to disable logging of specific variables in the config.

danielfrg avatar Sep 10 '18 15:09 danielfrg

I'm thinking about using a proxy to connect to S3, and provide unique access tokens for each Hub user. The idea is that the proxy evaluates the token, the logged in user and the action he/she's trying to perform, and determines if it's a valid action or not.

It's important to mention that I'm using a user-based prefixes strategy, where each user has its own namespace within the S3 bucket. The user should only have permission to read/write his/her namespace in the bucket.

I might work, but there's some implementations required on top of your s3contents app. ¯_(ツ)_/¯

martinzugnoni avatar Sep 10 '18 16:09 martinzugnoni

That makes sense.

You can take a look at this issue that faces a similar problem: https://github.com/danielfrg/s3contents/issues/45 but it doesn't solve the credentials issue.

One solution might be to pass the credentials from a JupyterHub setting that the users have to input.

danielfrg avatar Sep 10 '18 16:09 danielfrg

I wrote that issue. 😂

martinzugnoni avatar Sep 10 '18 16:09 martinzugnoni

ROFL :)

danielfrg avatar Sep 10 '18 16:09 danielfrg

There is always setting up an IAM role for the host. That does mean they have access from a server standpoint (which means they can fetch whatever is allowed by that role), but at least doesn't expose the tokens directly in config.

rgbkrk avatar Oct 05 '18 21:10 rgbkrk

Yes, we thought about IAM roles as well. But, as you say, the correct way would be to have a different user in AWS for each user in your system. Which is not possible if you have a ton of users. That's why we discarded that option.

martinzugnoni avatar Oct 05 '18 22:10 martinzugnoni

@martinzugnoni Is this issue still a problem for you? I have a few different work arounds that I could type-up if they would help you.

I have one block of JupyterHub config that drives AWS to dynamically create a new IAM User for each user on jupyterHub (creates them at first login if they don't already exist - either way, it pulls the IAM User keys each login (this does make it really hard to use the keys for anything else) - it should work for up to 4999 users (one IAM User is consumed by jupyterhub to do the work)

I also have a block starts out the same but then generates temporary keys off of the IAM User keys (so these keys expire and become harmless - in exchange for having a maximum time the users session can run - depending on the style chosen, the max time on the temp keys is 12 or 36 hours (min time 10 minutes))

The above could also be switched to do federated users - which has a basically unlimited number of users, but does force a 12 hour max on the keys

Assuming you still have an issue, let me know which constraint is more important to you and I'll see if I can make some sample code

milutz avatar Jul 06 '19 03:07 milutz

@martinzugnoni We have a different approach that may be useful for you or other people dealing with this problem. All our installation is based on OpenShift, but that must work at least in any Kubernetes environment, and other environments as well.

We have a central datalake based on Ceph. Each user has its own S3 credentials and a set of buckets he has access too (instead of one bucket and prefixes). We store all users information (credentials) in a Hashicorp Vault instance.

Users authenticate themselves on JupyterHub through OAuth using Keycloak. We store access information, so basically JWT access_token and refresh_token (encrypted), in the JupyterHub database (tokens are refreshed periodically).

When a user launches a Notebook, we use the pre_spawn_start function from JupyterHub to connect to Vault using the access_token and retrieve user's aws_key and secret. We have a special dynamic policy in Vault attached to the path where we store secrets (/some_path/user_id/secret) that allows each user to retrieve his secrets and only his (reason why we need a valid access token from Keycloak to enforce this policy).

Then we simply inject the secrets as env vars in the notebook and use S3Contents to connect. No problem if the user sees them as they are his! In fact it's even a bit more convoluted as we use HybridContentsManager to also connect the local filesystem, as well as connect all the different accessible buckets at a different path. Notebook code is here if you want to have a look: https://github.com/guimou/jupyter-notebooks-s3/blob/0e0979ae3fdb303e84a58ac85b6ec99357457ee6/minimal-notebook/jupyter_notebook_config.py#L28 but I will release in the coming days a clean version, along with Jupyterhub, Keycloak, Vault configurations and an article to explain everything in details. I'll update this comment when it's ready.

guimou avatar Aug 12 '19 03:08 guimou

@martinzugnoni Here is the article I published with more details on our implementation. The repos with everything are now there for JupyterHub and there for the notebooks. A huge thanks to @danielfrg for his work, if we have the chance to meet the beer's on me!

guimou avatar Sep 04 '19 21:09 guimou

Thanks @guimou for the Medium article. A question if you don't mind: in the section ####################### # Directories mapping # ####################### of jupyter_notebook_config.py, how to specify certificate file location, in order for S3ContentsManager to access S3 in HTTPS? I looked up the properties of c.S3ContentsManager. and c.NotebookApp. and do not have a clue.

chenglinzhang avatar Oct 30 '19 17:10 chenglinzhang

Is this still an issue? I've tried looking for the jupyter_notebook_config.py in my baremetal (two node) k3s cluster from the view of a regular user and I can't seem to find it in ~/.jupyter dir.
Is this only an issue for dockerspawner? Does this apply to Kubespawner?

nlhnt avatar Mar 23 '23 08:03 nlhnt