repo2docker
repo2docker copied to clipboard
[WIP] Add support for repo2docker.version
WIP for #490.
Problem
As discussed in #490 and #170, it currently isn't possible for users to pin their repository to a specific version of repo2docker. For example, the move from v0.5 to v0.6 included a change of the base image from Ubuntu artful (17.10) to Ubuntu bionic (18.04), which might cause unexpected problems for some users. The "reproducibility-minded" user may want to have some guarantee that their repo will continue to work when repo2docker is upgraded.
Implementation
This PR introduces two new methods in Repo2Docker, both called during build():
get_r2d_version: After fetching the repo, reads therepo2docker.versionfile and performs a sanity check that an image with the specified version (tag) exists on Dockerhub.run_r2d_version: Runsrepo2dockerfor the specified version via Docker to build the image with some special handling for local repos.
This PR also adds:
- Unit (integration?) tests for valid and invalid versions, local and remote repositories (
test_versions.py) - Basic documentation describing the
repo2docker.versionfile.
Thanks for working on this, @craig-willis! Exciting progress.
My preliminary thoughts are:
- repo2docker itself should verify that it is compatible with the version specified, and if not, quit with an error.
- We should have a helper script that can read the version and figure out appropriate way to run correct version of repo2docker.
- BinderHub should somehow find a way to respect this as well.
I don't think we should embed version checking functionality into repo2docker itself.
Thanks for the very prompt feedback @Xarthisius and @yuvipanda.
@yuvipanda I may be misunderstanding your comment, but it sounds like you don't like the idea of running jupyter/repo2docker (via Docker) from within jupyter-repo2docker (not just the version check). Let me know if otherwise.
One of my initial inclinations was to try to do the check outside of the python package in order to select the right version to run. For example, if I could assume that repo2docker is itself running via Docker, then it would be straightforward to pip install the correct version at runtime. However, it seemed like the preferred path (discussed in #490) was to allow this to happen directly from the python package.
I don't think we should embed version checking functionality into repo2docker itself.
Do you mean that figuring out the version should not be part of the repo2docker package that gets installed with pip install jupyter-repo2docker or not part of what you get when you import repo2docker or the script called repo2docker (created via the entrypoint) or something else?
What I had in mind was that (for the moment?) add a new script called (say) repo2docker-pinned that fetches the repo, figures out the version, concocts a command like docker run <dockersocketmountingstuff> jupyter/repo2docker:<tag> repo2docker ... and then executes that. (We can also merge these two scripts into one but it seems like for discussions having different names would help as there are already many things called repo2docker)
The repo2docker-pinned script should use the Repo2Docker class to fetch the repo and inspect the version (via a new method). And then subprocess.call() the docker run command it constructed.
Good idea that r2d should check on launch that it is the version requested by the repo.
Thanks, @betatim. It sounds like I missed the mark on this PR (it's been a great experience for me to engage with the codebase, regardless).
It sounds like this is the preferred behavior?
$ pip install jupyter-repo2docker==0.7.0
$ repo2docker https://gitlab.com/org/repo-with-v0.5-specified
Error: Repo requires version 0.5.0 but current version is 0.7.0, exiting.
Please run repo2docker-pinned.
$ repo2docker-pinned https://gitlab.com/org/repo-with-v0.5-specified
Running repo2docker v0.5
...
Picked local content provider.
...
(works as expected)
The repo2docker-pinned CLI would handle the initial fetch and, to not fetch twice, run via local provider (modifying the path/repo passed to the subsequent call). It would need to accept the full set of command line arguments, but would only actually care about the repo/path.
The repo2docker-pinned script should use the Repo2Docker class to fetch the repo and inspect the version (via a new method). And then subprocess.call() the docker run command it constructed.
Curious -- is there a benefit to using subprocess.call() instead of docker.containers.run()?
edit by betatim: added a missing $ in the code above
What I had in mind was that (for the moment?) add a new script called (say)
repo2docker-pinnedthat fetches the repo, figures out the version, concocts a command likedocker run <dockersocketmountingstuff> jupyter/repo2docker:<tag> repo2docker ...and then executes that.
This sounds great to me, @betatim. Ideally, I'd like them to be two different packages to begin with, and repo2docker-pinned depend on the CLI interface only rather than the Python interface.
Thanks for taking the time to review this and for the additional feedback. I'll plan on closing this PR and trying again.
I think this PR is pretty much what I described. The reason I spelt it out was that after asking Yuvi to be more precise I thought I should be explicit about what I was thinking as well.
The repo2docker-pinned script isn't required in the final thing but might help us get going because we can defer what ever logic we need to detect that we are in some kind of infinite loop.
Curious -- is there a benefit to using
subprocess.call()instead ofdocker.containers.run()?
I think the latter is a much better idea. I hadn't thought that far!
Not sure if we should make the optimisation of not fetching twice. Wondering what kind of bugs it might introduce because of content providers not being exactly the same. Should we start with fetching twice and see how bad it is performance wise? What do you think?
Thanks, @betatim.
I assumed creating repo2docker-pinned as a separate package would need to be done in an separate Github repo. If so, I could re-purpose this PR for documentation and the version check only. Let me know if otherwise.
A point of confirmation -- if I'm not mistaken, the repo2docker-pinned package will need to replicate some of the fetch, subdir, and binder_path handling if I can't use the Python package directly.
Should we start with fetching twice and see how bad it is performance wise? What do you think?
I'm wondering if I could use the CLI to do the fetch via --no-build --no-run --no-clean, possibly exposing the git_workdir as a CLI argument. This would at least ensure the same fetch logic is used both places.
@craig-willis I'd like to contribute here, if I can, because it nicely combines with #778
@minrk @betatim Can you add your current thoughts on this? IIRC we said that cloning twice would be fine for now. The aspect of a separate script did not come up though.
@nuest Sorry to admit that I've dropped the ball on this. In the Whole Tale system, we addressed version pinning a different way and I never completed what was discussed in this PR. If you'd like to take this work over, this PR can be closed.
(Did you mean to link to a different PR, #557 seems out of place?)
This is a very old PR at this point, but I'm grown vary of the idea of letting repo2docker startup another version of itself for security reasons and also thinking that its a very complicated feature hard to maintain.
Since its last activity was 2019, I'll go for a close here. We can absolutely reconsider that decision!