zero-to-jupyterhub-k8s
zero-to-jupyterhub-k8s copied to clipboard
Possible Error in Docs: Incorrect Config suggested in docs related to integration of EFS storage and EKS IAM
Background Context:
- I've never used / installed jupyter notebooks before
- I ran into this on a 4-hour troubleshooting call while helping another engineer debug their environment.
- The details I'll post are my recollection of a troubleshooting call that ultimately resulted in fixing the issue. The notes will be incomplete as I'm going off memory of bits and pieces seen over screen share, but I figured It'd be worth documenting the relevant notes I can call while they're still fresh in my head, in case this helps anyone else in the future.
- Since this wasn't my environment / a tool I'm not familiar with:
- I don't know how to reproduce it
- and I'm limited in the amount of details I can share.
Bug description
Following docs seems to suggest problematic configuration:
https://z2jh.jupyter.org/en/stable/kubernetes/amazon/efs_storage.html
The gist of the config problem is:
- If you follow the doc's recommended config, it does make EFS storage access work.
- BUT that config causes an integration issue in that if you were leveraging EKS IAM from the Jupyter Notebook web terminal that will stop working.
Here's a screen-snip of what the docs looked like at the time of the issue:
Proposed change of better suggested configuration:
I think the recommended config should look more like
uid: 0
fsGid: 100 (or blank/omitted entirely)
How to reproduce
- I can't see background context.
Your personal set up
- I'll be limited in details I can share, due to background context.
- Running on EKS with EFS
- Sometimes commands like
aws sts get-caller-identitywere run from the web terminal. So the web terminal was leveraging both:- AWS IAM integration
- AWS EFS mount point integration.
Expected behavior
- If a user is using a workflow that leverage's aws cli commands in Jupyter Notebook's web terminal
(likeaws sts get-caller-identity) - Then they want to add EFS support. Following the EFS support docs. Shouldn't make EFS work in exchange for breaking EKS IAM. Both should work.
Actual behavior
- If you have EKS IAM working, and you go to add EFS support. Following the docs results in EFS working at the expense of IAM breaking.
- Notes from a post call brain dump. (hopefully this helps others running into similar issue debug faster in the future)
Notes:
- adding uid: 0 fsGid: 0 CHOWN_HOME: "yes" Which was mentioned in their docs https://z2jh.jupyter.org/en/stable/kubernetes/amazon/efs_storage.html
- Adding the above fixed efs, but resulted in IAM breaking.
- By IAM breaking I mean:
Running
aws sts get-caller-identityin Jupyter web interface's terminal would fail with a filesystem permission error.:[Errno 13] Permission denied: '/var/run/secrets/eks.amazonaws.com/serviceaccount/token' - Important detail: The docker image had logic in startup.sh & single user.sh to do a live change of the user from root to jovyan upon startup, the temp root access was likely used to change the ownership of home (efs) to the jovyan user.
- https://github.com/Paperspace/jupyter-docker-stacks/blob/master/base-notebook/start.sh
- https://github.com/Paperspace/jupyter-docker-stacks/blob/master/base-notebook/start-singleuser.sh
- The reason it was broke seemed to be that the script changed the active shell user & ownership of most files on the container's file system to jovyan, BUT there was a key file related to IAM that was still owned by root as a result of the container starting off as root user.
/var/run/secrets/eks.amazonaws.com/serviceaccount/tokenwas owned by root user and group, by root:root,
(perls -lah /var/run/secrets/eks.amazonaws.com/serviceaccount/token
so the jovyan user didn't have access.
(we played around a bit and found that even if you override the kube yaml defaults which list that as read only, it stays read only due to the nature, so it's permissions can't be updated at run time / only established at container creation time.) - We did discover a hacky workaround that allowed both (efs and iam) to work at the same time using the specs recommended in the doc (of uid:0, fsGid:0)
The workaround involved:
- updating 2 settings to enable sudo to work in the container
https://z2jh.jupyter.org/en/stable/resources/reference.html#singleuser-storage-static
was used as a point of reference
singleuser.allowPrivilegeEscalation was set to true
and we had to enable some setting in another spot that I can't recall off the top of my head.
That allowed the following commands to work sudo cp /var/run/secrets/eks.amazonaws.com/serviceaccount/token /home/jovyan/tokensudo chown $USER:$USER /home/jovyan/tokenexport AWS_WEB_IDENTITY_TOKEN_FILE=/home/jovyan/tokenaws sts get-caller-identity- (before
aws sts get-caller-identitywas throwing a file system permission error, whenwhoamireturned jovyan) - The above allowed both efs & IAM to work. It was a hacky manual workaround, but it at least proved it was possible for both to work at the same time.
- updating 2 settings to enable sudo to work in the container
https://z2jh.jupyter.org/en/stable/resources/reference.html#singleuser-storage-static
was used as a point of reference
singleuser.allowPrivilegeEscalation was set to true
and we had to enable some setting in another spot that I can't recall off the top of my head.
- Removing those newly added configuration's (uid: 0, fsGid: 0) that the docs (https://z2jh.jupyter.org/en/stable/kubernetes/amazon/efs_storage.html) suggested should be added to make efs work, brought efs back into a broken state, but fixed IAM. (basically rolled back to the previous config.)
- When IAM was working
ls -lah /var/run/secrets/eks.amazonaws.com/serviceaccount/token
showed jovyan had access to it.
(I think the file system permissions were set to user_id:group_id, 1000:100, which would correspond to jovyan:users)
- When IAM was working
- Through Trial and error, we discovered a solution that allowed both (EKS IAM and EFS storage mount) to work at the same time.
- We went against what the docs recommended and set it to (uid:0, & blank fsGid, which I think has an explicit default of fsGid:100 / represents a file system group named users.)
- I think we also left enable root or singleuser.allowPrivilegeEsclation enabled as well from our testing, but I don't recall if that was actually needed or not.
- That allowed both (AWS IAM calls and EFS file system access) to work. I'll try to recall some observations of the setup.
- Even though uid:0 was set, there were some startup scripts built into the container that made it so the user you got when you requested an interactive web terminal via the web GUI interface would be jovyan when checked with the
whoamicommand. - In that setup of the ideal config
ls -lah /var/run/secrets/eks.amazonaws.com/serviceaccount/tokenshowed 0:100 (owned by "root"(0), and group named "users"(100) had access), which allowed aws cli commands from user jovyan to continue working.
- Even though uid:0 was set, there were some startup scripts built into the container that made it so the user you got when you requested an interactive web terminal via the web GUI interface would be jovyan when checked with the
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:
Thanks for the notes. I think the EFS doc is aimed at a relative newcomer to AWS. If you've got more experience of AWS and you've got some time it might be worth checking the latest offerings from AWS. For example, it looks like there's an EKS CSI driver for EFS: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html If you have a chance to look at this please let us know if it's a better replacement for the current instructions!