repo2docker [WIP] Add saving image to tarball and a buildpack to load it

[WIP] Add saving image to tarball and a buildpack to load it

Open nuest opened this issue 4 years ago • 12 comments

This adds two features:

a --save-image option that saved the image to a file image.tar in the binder directory
a TarballBuildPack that will load and run that image if a file image.tar is found

This can be useful if a specific workflow should be preserved. It currently prints a warning if the r2d version used to create the image mismatches the one used to load it, but with #550 it could also switch to that version (pending changes discussed in #490).

I did not test with --subdir yet.

Try out locally with tarball from Zenodo:

repo2docker https://sandbox.zenodo.org/record/367144

Here is an example interaction.

(binderhubsprint) daniel@nuest:~/git/elife-sprint/repo2docker/tests/conda/binder-dir$ repo2docker --save-image --no-run .
Picked Local content provider.
Using local repo ..
Using CondaBuildPack builder
Step 1/51 : FROM buildpack-deps:bionic
 ---> 536a38f87e4b
Step 2/51 : ENV DEBIAN_FRONTEND=noninteractive
 ---> Using cache
 ---> d520bc3e9203
[...]
Step 51/51 : CMD ["jupyter", "notebook", "--ip", "0.0.0.0"]
 ---> Running in a3ffe7bb8bf8
Removing intermediate container a3ffe7bb8bf8
 ---> 5311e39c82be
{"aux": {"ID": "sha256:5311e39c82be10a6a0a0940c3e08fc904d66fc53f0b3e7d5df2163181668cc10"}}Successfully built 5311e39c82be
Successfully tagged r2d-2e1567783481:latest
Saving image to file binder/image.tar
Successfully saved image
(binderhubsprint) daniel@nuest:~/git/elife-sprint/repo2docker/tests/conda/binder-dir$ tree .
.
├── binder
│   ├── environment.yml
│   └── image.tar``
├── Dockerfile
├── environment.yml
└── verify

1 directory, 5 files
(binderhubsprint) daniel@nuest:~/git/elife-sprint/repo2docker/tests/conda/binder-dir$ repo2docker .
Picked Local content provider.
Using local repo ..
Using TarballBuildPack builder
[I 15:33:33.679 NotebookApp] Writing notebook server cookie secret to /home/daniel/.local/share/jupyter/runtime/notebook_cookie_secret
[I 15:33:33.973 NotebookApp] JupyterLab extension loaded from /srv/conda/envs/notebook/lib/python3.5/site-packages/jupyterlab
[I 15:33:33.973 NotebookApp] JupyterLab application directory is /srv/conda/envs/notebook/share/jupyter/lab
[I 15:33:33.979 NotebookApp] nteract extension loaded from /srv/conda/envs/notebook/lib/python3.5/site-packages/nteract_on_jupyter
[I 15:33:33.981 NotebookApp] Serving notebooks from local directory: /home/daniel
[I 15:33:33.981 NotebookApp] The Jupyter Notebook is running at:
[I 15:33:33.981 NotebookApp] http://127.0.0.1:33661/?token=016d6f08e11e5e22f896d7ae48401726b650770fa4954079
[I 15:33:33.981 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 15:33:33.981 NotebookApp] No web browser found: could not locate runnable browser.
[C 15:33:33.982 NotebookApp] 
    
    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://127.0.0.1:33661/?token=016d6f08e11e5e22f896d7ae48401726b650770fa4954079

Sep 06 '19 16:09 nuest

I'm curious, what made you prefer this workflow over pushing to a docker registry?

Sep 06 '19 17:09 manics

@manics A tarball in a scientific data repository along the data that was analysed is (hopefully) available more longterm, and more likely to be available and accepted in a scholarly context than a scholarly publisher running a container registry long term.

Does that make sense to you?

Sep 06 '19 20:09 nuest

As a a further illustration, here is the log I just got from running the example from Zendo Sandbox after the merge with master

$ repo2docker https://sandbox.zenodo.org/record/367144
Picked Zenodo content provider.
Fetching Zenodo record 367144.
Fetching image.tar
Using TarballBuildPack builder
repo2docker version missmatch: image label has '0.10.0+14.gb20eb6a.dirty' but running '0.10.0+55.g371b925'
[I 14:03:07.617 NotebookApp] Writing notebook server cookie secret to /home/daniel/.local/share/jupyter/runtime/notebook_cookie_secret
[I 14:03:07.900 NotebookApp] JupyterLab extension loaded from /srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterlab
[I 14:03:07.900 NotebookApp] JupyterLab application directory is /srv/conda/envs/notebook/share/jupyter/lab
[I 14:03:07.905 NotebookApp] nteract extension loaded from /srv/conda/envs/notebook/lib/python3.7/site-packages/nteract_on_jupyter
[I 14:03:07.906 NotebookApp] Serving notebooks from local directory: /home/daniel
[I 14:03:07.906 NotebookApp] The Jupyter Notebook is running at:
[I 14:03:07.906 NotebookApp] http://127.0.0.1:54013/?token=bc2ae6e7fd87a9447cb0eba74e1eee44111042179788d767
[I 14:03:07.906 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 14:03:07.911 NotebookApp] No web browser found: could not locate runnable browser.
[C 14:03:07.911 NotebookApp] 
    
    To access the notebook, open this file in a browser:
        file:///home/daniel/.local/share/jupyter/runtime/nbserver-1-open.html
    Or copy and paste one of these URLs:
        http://127.0.0.1:54013/?token=bc2ae6e7fd87a9447cb0eba74e1eee44111042179788d767

Note the "version mismatch" log in line 5, which #490 could .. solve.

Sep 11 '19 14:09 nuest

@nuest I see what you're getting at. Would using an established Docker registry such as Docker Hub or quay.io work? One issue with tar-files is you still need to publish them, and ideally make them discoverable which means adding metadata to wherever they're hosted.

Sep 11 '19 14:09 manics

@manics I admit I did not consider container registries so far. AFAIK it's not so easy to download a tarball from a registry without a docker client (see https://devops.stackexchange.com/questions/2731/downloading-docker-images-from-docker-hub-without-using-docker), so a ContainerRegistryBuildpack would make more sense to me in that case.

I'd like to cover the case where users of BinderHub intentionally create a snapshop of their Binder and publish that in a data repository with the metadata to enable discovery.

Sep 11 '19 15:09 nuest

@betatim Any idea why the tests might fail on Travis but not locally?

Also note that the tests do fail on Azure, but the job says "successful": https://dev.azure.com/jupyter/repo2docker/_build/results?buildId=25&view=logs&jobId=7ff9283a-ab30-5e9f-8967-b9fdc546360c

Sep 11 '19 15:09 nuest

@betatim @choldgraf It's been a while since I worked on this, but I think this would be quite a useful feature and really powerful for scientific use cases in combination/variation with #550.

I'm happy to rebase this on the current development version and get tests working, if you agree this is a useful addition.

May 29 '20 07:05 nuest

It sounds to me like it would be a useful addition for archiving purposes. Correct me if I'm wrong, but I find that archivists and online services that do archiving usually want "the whole bundle" to be a part of the archive if at all possible, rather than just pointers to some other service (like a container registry).

I think my concern is that it feels a little bit counter to one of the guiding principles of Binder / repo2docker, which is that people shouldn't have to know anything about Docker in order to make their work reproducible. I can see why this would be useful for sure, but I wonder when it would be preferred over, say, telling an author that they need to do a better job of pinning their versions in their dependency files. Do you have thoughts there @nuest ?

May 29 '20 15:05 choldgraf

@choldgraf if it's not suitable to go into the mainline could this be done as a plugin/extension, that overrides push_image to save to a file instead?

May 29 '20 20:05 manics

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/dataverse-community-meeting-short-talk/4723/8

Jun 09 '20 14:06 meeseeksmachine

Apologies for not reacting sooner @choldgraf ! This issue has just come up in my mind again, because of the recent announcement by Docker to delete images that are not regularly pulled. I never thought that the Docker Registry is a good place for scientific environments, but some might have.

Anyways, to your questions:

I agree with you that Binder users should not know anything about the underlying containerisation technology. I think a tarball buildpack would be only one half, the other would have to be an easy way to get the tarball, which I have only briefly thought about. Ideally, users can just click "export container archive" in the Jupyter UI on MyBinder.org :-).
Telling the author to pin the dependencies better would require them to have full knowledge of all dependencies. Especially in the case of language packages (R, Python) requiring system packages this is not trivial, especially if Binder should be usable broady by scientists.
To me, the image would be another layer to fall back to when worrying about reproducibility. [Excursion: I'd like to have a "archive Binder to Zenodo" button, that not only stores the repos files, the image tarball, but also the Dockerfile - just in case. That won't work though until Zenodo supports some mechanism for a platform to act on a users behalf.]

@manics I haven't followed r2d closely the last year - could you point me to an example for such a plugin mechanism?

Aug 27 '20 07:08 nuest

@nuest It's some work in progress: https://github.com/jupyter/repo2docker/pull/848 My thinking was you could override the default docker engine to "push" to a tarfile instead of a registry. It wouldn't help with the loading side though.

Is there a specification for the exported tarfile and metadata? I believe Podman can import tar files from Docker, but it would be nice if there was something more than "whatever Docker exports". I had a quick look through the OCI website but couldn't find anything, though I didn't look too hard.

Aug 28 '20 13:08 manics

I'm triaging among the PRs in this repo. It has been two years since this PR received activity in any way and we are low on maintenance capacity, so I'm leaning towards conservative decisions to ensure we can manage maintenance going onwards.

Due to this, I'll go for a close of this PR. It can absolutely be reopened and such, this is not a final call!

It's been a while since I worked on this, but I think this would be quite a useful feature and really powerful for scientific use cases in combination/variation with https://github.com/jupyterhub/repo2docker/pull/550.

It seems like a powerful feature, but also adding quite a bit of complexity that could be offloaded to not be maintained by repo2docker. I'm thinking a container registry where a built image was pushed to and run straight via docker or similar would also be an option. Alternatively, to not use a container registry but to do repo2docker by doing repo2docker --no-push ... first, and then yourself do docker save to convert it to a .tar file, and then do docker load to get it back as a container in the local docker daemon after having retained it anywhere.

Oct 30 '22 23:10 consideRatio

repo2docker repo2docker copied to clipboard

[WIP] Add saving image to tarball and a buildpack to load it

repo2docker
repo2docker copied to clipboard