repo2docker icon indicating copy to clipboard operation
repo2docker copied to clipboard

Ignore missing submodules?

Open mdeff opened this issue 4 years ago • 5 comments

Proposed change

Don't fail the build if a submodule is missing.

Example of failure (used to work, broke once a student removed their personal repo): https://mybinder.org/v2/gh/mdeff/ntds_2018/outputs?urlpath=lab.

Alternative options

Update the repository to remove the now missing submodules. But that has a maintenance cost and breaks the intent of preserving the original state of a repository for reproducibility.

Who would use this feature?

People who freeze (archive) repositories for the sake of reproducibility. An old repo might depend on submodules that are not available anymore. This shouldn't completely prevent people from building a container and running the code.

Downside: this is kind of allowing a build with missing dependencies. The problem is however more severe as github repositories are deleted more often than pypi or conda packages. I would actually even proceed with missing pypi or conda packages after emitting a warning (which should ideally be made more visible than in the build log).

(Out-of-topic, but a way to be notified of binder build failures would be great. Checking manually that it still works is sub-optimal.)

How much effort will adding it take?

Easy. Check the return value of git submodule update --init --recursive, emit a warning if non-zero, and move on with the build.

Who can do this work?

Anybody with a shallow understanding of the codebase.

mdeff avatar May 28 '20 16:05 mdeff

Downside: this is kind of allowing a build with missing dependencies.

Seems like we already allow that for unsupported python versions (see #552).

A consistent policy would be great. My preference is to emit warnings (on unsupported runtime, missing dependency or submodule etc.) rather than fail, but make the warnings very visible (what about a banner in the launched jupyterlab?).

mdeff avatar Jun 15 '20 17:06 mdeff

Unfortunately, with the way Binder works, at least, warnings are essentially invisible since they only occur at build-time and build only occurs once, not for each launch. Only local repo2docker users really get a benefit from warnings.

It's hard to say whether a submodule is 'optional' or not. I can only recall using repos (including ipython/ipython for quite some time) where a failure to clone a submodule should be treated as a failure, since it's a fundamental component of the origin repo, which doesn't work at all without it. Based on my biased experience, I'm inclined to keep the error on failure to clone recursively, but it sounds like your experience is different. Do you have some real-world example repos where partial builds are more desirable than an error indicating the repo cannot be fully cloned as-is?

this is kind of allowing a build with missing dependencies.

In general, we try to fail on missing dependencies (requirements.txt, etc.). As discussed in the linked #552, the issue appears to be that we should be raising on an explicitly requested but unsupported Python and it's a bug that we don't, so failure on missing submodules seems like the most consistent behavior here.

minrk avatar Aug 10 '20 11:08 minrk

Unfortunately, with the way Binder works, at least, warnings are essentially invisible since they only occur at build-time and build only occurs once, not for each launch. Only local repo2docker users really get a benefit from warnings.

What about storing the build logs in the docker image, and showing the warnings at launch-time?

It's hard to say whether a submodule is 'optional' or not.

Agreed. Similarly, it's hard to say whether a python version or package (pypi, conda) is necessary or optional. As letting users specify optional deps seems overkill, I propose to warn instead of fail on missing deps (be it python, packages, or submodules).

Do you have some real-world example repos where partial builds are more desirable than an error indicating the repo cannot be fully cloned as-is?

  • Teaching repos, where some deps (packages or sub-modules) are only needed for parts of the course.
  • Optional deps for analysis pipelines (e.g., only need cartopy / matplotlib if you'll plot the results). Basically the equivalent of optional deps in setup.py, while we can't specify optional deps in requirements.txt, environments.yml, or .gitmodules.

most consistent behavior here

Yep that sounds like the most important. I'd prefer to warn on missing deps, but if the project decides to error, it should do so for every kinds of deps.

mdeff avatar Aug 13 '20 14:08 mdeff

Perhaps another more clear way would be to allow repo2docker to not initialise certain submodules. I.e. a configuration file to determine which sub-modules should be initialised, and optionally whether that initialisation would be allowed to fail, say in yaml:

submodules:
  submodules/a:
    - error: pass|error|warn

I have a repo where I don't want to download the submodule.

zerothi avatar Sep 09 '20 06:09 zerothi

One more data point in favor of being less strict about "missing" submodules. DataLad (http://handbook.datalad.org/) uses the submodule mechanism to specific subdataset component/dependencies. Scientific datasets, e.g. https://github.com/psychoinformatics-de/studyforrest-data use this to link all components in a single toplevel repository (that is the most useful entrypoint for demos). However, not all dataset components can have the same level of access (think personal data in a neuroimaging study), hence some dataset components will be inaccessible to a public binder instance. However, they are not missing or invalid either.

mih avatar May 04 '21 08:05 mih