repo2docker icon indicating copy to clipboard operation
repo2docker copied to clipboard

[Discussion] reproducibility

Open minrk opened this issue 7 years ago • 5 comments
trafficstars

Discussion issue for general topics of reproducibility and what's in and out of scope for repo2docker (and Binder).

We currently have a tension between our scientific goal of reproducibility and the maintenance goal of keeping everything up to date. We have the same issue that everyone who pursues reproducibility has, which is specifying the environment as strictly as necessary (so it's correct), but no stricter (so it stays useful). Conservative approaches are to use overly-specified environments (e.g. pip freeze / conda env export), which we should make sure to support well and document for the more reproducibility-minded users.

A user who wants to ensure a truly reproducible build must:

  • use a pip freeze or conda env export-produced environment specification
  • pin the Python version (for pip, already done above for conda)
  • pin the distro/base image
  • probably pin repo2docker itself (easy for manual use cases, not available on Binder)

Right now, the only truly reproducible builds available on Binder are custom Dockerfiles, which is something I want fewer people to use, not more. But we currently have no answer for reproducibility with any other builders, as there is no way for users to be sufficiently strict about the environment.

minrk avatar Dec 18 '17 15:12 minrk

I think this is a super important topic, especially when it comes to the publishing world. This is related to #93 though that's a more specific topic.

choldgraf avatar Dec 18 '17 22:12 choldgraf

I really like the idea of pinning repo2docker versions, which seem like the easiest (and maybe only?) solution to this problem. If we can guarantee that a properly prepared repo will always produce the same Dockerfile (rather than image, since we can not guarantee that) for any given version of repo2docker, I think that's good enough no?

We might have to write version shims to maintain binderhub <-> repo2docker compatibility, but that seems not entirely too difficult. We could switch from passing in commandline arguments to using something more complex and versionable too if we want.

yuvipanda avatar Jan 10 '18 02:01 yuvipanda

Thinking more on this, there's three things we should try to allow users to pin:

  1. Versions of languages (Python, R, Julia, etc)
  2. Versions of libraries for the language installed by the language specific package manager (conda, pip, whatever R uses, etc)
  3. Versions of packages installed by the system package manager (apt)

We could / should use runtime.txt for (1), recommend pinning for (2), and make apt.yaml for (3). That's a good start I think, and gives us lots of low hanging fruit to work with...

yuvipanda avatar Jan 10 '18 11:01 yuvipanda

More thoughts on reproducibility: freeze conda build numbers as well or not

betatim avatar Jan 12 '18 08:01 betatim

Definitely an important discussion, but probably something we'll need to engage with the community on https://discourse.jupyter.org/ especially if at some point we need to make major upgrades to R2D (e.g. the base image?)

manics avatar May 17 '21 21:05 manics