docker-stacks
docker-stacks copied to clipboard
Install nbgitpuller in the base image
Describe your changes
nbgitpuller is a very popular way to distribute content to users, both on mybinder.org as well as on JupyterHubs. On Binder, this allows for fast launches where the environment is separated from the content (see https://discourse.jupyter.org/t/tip-speed-up-binder-launches-by-pulling-github-content-in-a-binder-link-with-nbgitpuller/922). So by putting nbgitpuller in these images, users can simply pick an image to launch and then just pull in the content they need, instead of having to rebuild their own image. It's also extremely popular when used with JupyterHub, and is included by default in the TLJH distribution.
Checklist (especially for first-time contributors)
- [x] I have performed a self-review of my code
- [ ] If it is a core feature, I have added thorough tests
- [x] I will try not to use force-push to make the review process easier for reviewers
- [ ] I have updated the documentation for significant changes
FYI a previous discussion https://github.com/jupyter/docker-stacks/issues/751
Ah I see! I would really love to see it included in the base image though. As evidence of its usefulness, I can present:
- Guide to using it with TLJH: https://tljh.jupyter.org/en/latest/howto/content/nbgitpuller.html
- Mention on how to use it with z2jh in the docs: https://z2jh.jupyter.org/en/latest/jupyterhub/customizing/user-environment.html#using-nbgitpuller-to-synchronize-a-folder
- Popularity in the mybinder use case: https://discourse.jupyter.org/t/tip-speed-up-binder-launches-by-pulling-github-content-in-a-binder-link-with-nbgitpuller/922
- Personal experience in helping run a lot of hubs, and trying to move people to use images directly from docker stacks to the extent possible! In many educational contexts, the base images are more than enough - but they do need nbgitpuller to distribute content.
Since I think this project doesn't yet have clear acceptance criterias for introducing new software in images, I'd like to make a thorough effort to establish some precedence on what is worth considering.
- Software impact for users of docker-stacks images I think @yuvipanda has covered this quite well above already, but I'd like to add that I think nbgitpuller can help reduce the need to build new images. Sometimes its enough to be able to start an image and download something little extra, and if nbgitpuller can make that happen, it can offload users of building a custom image and/or requesting that docker-stacks add something to an image.
- Software health, details, and maintenance status
I consider
nbgitpullerto be quite healthy based on:- it maintains documentation
- a changelog is actively maintained
- semver2 versioning is adopted, and its beyond version 1.0.0
- has passing tests for various python versions (including py311) and git versions, and tests are passing
- a release procedure with helpful automation is established
- multiple people are involved in its maintenance
- has been around for several years
- provides a conda-forge package besides a pypi package, where both are kept up to date
- is a noarch package, it won't be troublesome to build on arm64 etc
- Installation consequences
- nbgitpuller relies on
gitto be available on the system, jupyter-server, and tornado - all of these are available in minimal-notebook already, so adding nbgitpuller there doesn't add something extra
- it adds 1.5MB to a 1.6GB image according to
diveinspecting it - it took my computer 15 second to run the additional
mamba installstep in minimal-notebook, but I expect that most of this time stemmed from doingmamba installat all
- nbgitpuller relies on
- docker-stacks "image fit"
nbgitpullerdoesn't fit in base notebook, because it doesn't havegit, but minimal-notebook does- nbgitpuller userfulness is independent of the additions done beyond minimal-notebook, its a jupyter server extension plausible to provide value for any docker-stacks image based on minimal-notebook.
- minimal-notebook doesn't yet install a python package with mamba, only
apt, its described to "jupyter/minimal-notebook adds command-line tools useful when working in Jupyter applications.". I think its reasonable to broaden this to include a jupyter server extension generally useful across the derived images.
- Why it shouldn't just be a documented recipie
- If its not part of the image, it won't need a recipie as its simple to install
- This package makes sense to have pre-installed or not at all pretty much. For example, a teacher could with it pre-installed ask users to click a link to get started, but without having it pre-installed the teacher wouldn't beneift from having nbgitpuller at all - the point can often be to have students just click a link to get started with curriculum material and not need to ask them to do misc other things first.
My conclusion
Initially I wasn't confident what I thought even though I value nbgitpuller, but after reflecting on the things above, I think its a great idea to add it to minimal-notebook.
Action points
- [ ] Reach agreement to install this or not in some image @mathbunnyru @manics what do you think about introducing nbgitpuller?
- Make at least these updates to this PR:
- [ ] Update
nbgitpullerto install in the agreed image - [ ] Update docs about minimal-notebook to reflect it goes beyond
aptpackages, but also a general purpose jupyter server extension - [ ] Update tests suite to also verify nbgitpuller can be imported like done for most other packages I think
- [ ] Update
References
- The Selecting an image docs provides a useful overview to keep in mind considering where
nbgitpullercould fit - The nbgitpuller's dependencies
Since I think this project doesn't yet have clear acceptance criterias for introducing new software in images, I'd like to make a thorough effort to establish some precedence on what is worth considering.
I'd add a 6th point to those criteria: Impact on security: E.g. Does the package open additional ports, or add new web endpoints, that could be exploited?
- IIRC (haven't looked at nbgitpuller for ages): nbgitpuller allows arbitrary git repositories to be automatically pulled into the filesystem via a URL. A malicious actor could send an nbgitpuller URL to a user who would launch their singleuser server and clone the malicious repo. Although automatic execution is not possible a naive user could easily load the malicious content.
Based on that I'd be reluctant to automatically push it out to every JupyterHub.
I will take a look at this next week. I'm not familiar with nbgitpuller, and I will try it.
Thanks @mathbunnyru! Let me know if I can help in any way :)
I agree with almost all the points made by @consideRatio (and thank you for sharing these thoughts, they are extremely useful) and I do agree that impact on security is important as well.
But I have some doubts about this one.
This package makes sense to have pre-installed or not at all pretty much. For example, a teacher could with it pre-installed ask users to click a link to get started, but without having it pre-installed the teacher wouldn't beneift from having nbgitpuller at all - the point can often be to have students just click a link to get started with curriculum material and not need to ask them to do misc other things first.
This assumes that the teacher doesn't want/can't change the single-user image, otherwise, the teacher could have had an image derived from one of our images and installed this package on top (and then shared a link).
I think "our images work perfectly in 95% of use cases" is not feasible. There will always be things/features missing and you only need this one more package. What is feasible (in my opinion) is "we cover 99% of use cases, but you may need to install a couple of things on top of them, and we're going to make this as easy for you as we can".
That being said, I believe our images should move forward and when new things become widespread (or some things get deprecated), we should adapt our images accordingly. Still, drawing a line between what to include and what not is difficult.
Maybe we should have some kind of vote on what to include/exclude. (and the same for the new images like pytorch-notebook).
I might have some bias when I don't want to add new things to make my life as a maintainer easier, and I definitely can't tell what our community needs (because I'm not always as engaged as other Jupyter members), so I shouldn't a final decision on this question.
Voting should allow us to add new things when there is a certain need in the community.
It takes some time to set up, but it should work better than dealing with these things on an individual basis each time.
I will still have the power to close PRs/feature requests which do not meet some eligible criteria (I think the @consideRatio checklist is a nice example of what these criteria might be), so we only will have this vote happening a few times per year (based on the PRs I've seen for the past few years).
Obviously, people will be able to share their ideas about why this package should/shouldn't be included, so people might change their opinions and will be open for discussion.
At the same time, we will have some deadlines for the vote (like 3-4 weeks), to give everyone a chance to become familiar with the change and vote accordingly.
I would also appreciate it if someone could propose a list of voters. I think @consideRatio, @manics and @yuvipanda might have ideas about who should vote for the changes.
We should also be able to introduce new voters, so sometimes we will have votes for introducing new voters.
If we implement this, obviously https://github.com/jupyter/docker-stacks/issues/1958 will be solved, as it has exactly the same problem.
Please, tell me what you think. Sorry, I haven't just merged the PR 😆
@yuvipanda could you please do things suggested in Action points above, and then I will start the voting process?
We need to think what should we do, when one of the voters suggests a new feature though.
@mathbunnyru I don't think we should ask for the PR to finalize before voting, as I don't think it gives us more information of relevance to the decision on the intent to approve a PR to add ngitpuller. It can be the other way around though!
I suggest that we try to decide if nbgitpuller should be installed to minimal-notebook (currently the PR adds to base-notebook that doesn't have git installed which it needs).
Ok, makes sense to me, thanks. How do you think, what should we do in the case one of the voters suggests a new feature?
My take is that nbgitpuller is very reasonable to install in minimal-notebook based on all the criterias 1-5 in the policy, but I'm hesitant about criteria 6 about security.
With nbgitpuller, a new web server endpoint is added that can have files be downloaded into a folder. It can also be used to after download redirect the user to some path, for example to have jupyterlab open a file - or anything else that can be prompted by browsing to some URL.
I think the redirection part isn't a notable concern, that can be done without nbgitpuller by directly providing a link that does something first. I think the download is a concern though. I'm thinking about an exploit could be to craft a link that downloads a folder that will be read by other software, such as configuration folders .config, .local, .aws, .jupyter etc. If they are downloaded into the home folder, they may be unseen for the user not always showing hidden folders, and they may probably lead to malicious code execution by the person providing the malicious nbgitpuller link. nbgitpuller wouldn't create a new folder if it already exists, but could create new folders.
Any package installed in the image being malicious would be trouble, so installing anyting is a risk. Installing nbgitpuller though is introducing another kind of risk though, about clicking links that are maliciously crafted.
I think I'll abstain from voting favorably for now, but think a resolution to https://github.com/jupyterhub/nbgitpuller/issues/330 could make me rethink it.
Ok, I think your point is valid and the upstream issue is relevant to solving it, nice work 👍
I'm currently not convinced people aren't able to install this package to benefit from it, so I'm not in favor of merging this package currently.
Since I think this project doesn't yet have clear acceptance criterias for introducing new software in images, I'd like to make a thorough effort to establish some precedence on what is worth considering.
I'd add a 6th point to those criteria: Impact on security: E.g. Does the package open additional ports, or add new web endpoints, that could be exploited?
- IIRC (haven't looked at nbgitpuller for ages): nbgitpuller allows arbitrary git repositories to be automatically pulled into the filesystem via a URL. A malicious actor could send an nbgitpuller URL to a user who would launch their singleuser server and clone the malicious repo. Although automatic execution is not possible a naive user could easily load the malicious content.
Based on that I'd be reluctant to automatically push it out to every JupyterHub.
I think @manics has expressed his opinion here. Please, correct me if your opinion changed.
After considering https://github.com/jupyterhub/nbgitpuller/issues/330 a bit more I'm even more against it.
A potential mitigation (in addition to the suggestions in https://github.com/jupyterhub/nbgitpuller/issues/330) is to install nbgitpuller but make it's activation conditional on an environment variable. That way anyone running a JupyterHub can include something like NBGITPULLER_ENABLED=1 in their DockerSpawner/KubeSpawner to enable it for all their users.
Another option- repositories cloned by nbgitpuller most likely need other libraries than what's available in the base/minimal images. The https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#jupyter-datascience-notebook image incorporates a lot of libraries, so a datascience-nbgitpuller image is an option, but it means maintaining yet another image.
And one more thought (prob for a different issue?), what can we do to make it easier for people to publish their own images with extensions? For example, could we have a template repository with a GitHub workflow or action where the workflow/action takes the template parameters:
- image-name
- docker-stacks image to use as base
- list of additional conda packages to install
- optionally list of other custom scripts/config
The workflow could be written to push to ghcr.io/${{ github.repository }} (not sure of the exact var names) which means minimal editing is required when someone uses the template.
Another option- repositories cloned by nbgitpuller most likely need other libraries than what's available in the base/minimal images. The https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#jupyter-datascience-notebook image incorporates a lot of libraries, so a
datascience-nbgitpullerimage is an option, but it means maintaining yet another image.
I definitely don't want to add another image just for nbgitpuller, in my opinion it's a bit too much for this package and it doesn't just increase the maintenance, but from user perspective they will have to spend more time choosing an image in this case.
And one more thought (prob for a different issue?), what can we do to make it easier for people to publish their own images with extensions? For example, could we have a template repository with a GitHub workflow or action where the workflow/action takes the template parameters:
- image-name
- docker-stacks image to use as base
- list of additional conda packages to install
- optionally list of other custom scripts/config
The workflow could be written to push to
ghcr.io/${{ github.repository }}(not sure of the exact var names) which means minimal editing is required when someone uses the template.
We actually already have this (in a bit different implementations though). We have a cookiecutter template repo and docs how to use it.
I also made an attempt to not to use cookiecutter and just create a simple example repo (which can be forked and changed), but it didn't get enough attention, so I didn't push it. https://github.com/mathbunnyru/custom-notebook
+1 for not adding another image.
My drive comes from these two facts:
- Literally every single JupyterHub used for education that I have encountered uses and requires nbgitpuller
- I'd love for people to just use jupyter/docker-stacks unmodified for those use cases, as I believe it currently satisfies 95% of the use case, and nbgitpuller is part of the last 4%.
That said, I agree it's important that people not get docker images that are insecure because of features they don't use at all, and non-educational cases often don't have any need for nbgitpuller! So how about we implement the following changes in nbgitpuller:
- [ ] Default to not allowing creation of repos starting with a
.https://github.com/jupyterhub/nbgitpuller/issues/330 - [ ] Disallow setting
targetPathas part of the URL https://github.com/jupyterhub/nbgitpuller/issues/330#issuecomment-1783594862
With these, nbgitpuller links would only create directories (it already fails if the directory is not empty) based on the name of the repo being cloned.
If that sounds ok, I'm happy to work on implementing these in nbgitpuller.
I started https://github.com/jupyterhub/nbgitpuller/pull/332 as a draft to show how we could disable targetPath by default.
Thanks @yuvipanda. I am not as optimistic about 95% without external packages as you are.
Still, I see your point about education and JupyterHub + these docker images are an essential part of many educational programs. And I'm pretty sure you're right about the use of nbgitpuller there.
That being said, if you fix 2 action points above, I think I would be ok with merging nbgitpuller to our images.
It would be also great if you considered the comment above:
A potential mitigation (in addition to the suggestions in https://github.com/jupyterhub/nbgitpuller/issues/330) is to install nbgitpuller but make it's activation conditional on an environment variable. That way anyone running a JupyterHub can include something like NBGITPULLER_ENABLED=1 in their DockerSpawner/KubeSpawner to enable it for all their users.
I have no idea if it's easy, fits Jupyter Hub well, or something like that, but I think it's a nice option to have, and this way we'll be a bit more secure. But for me, 2 action points above are a must, and this one is nice to have.
Marked this as a draft, since we won't be merging this until nbgitpuller issues are fixed.