containers
containers copied to clipboard
Ability to get new containers while having all prior or used containers frozen to specific version
Currently it would not be possible to update from original containers dataset with the purpose to only get new containers, while keeping current ones (possibly already used in some analysis) at current version -- "merging" of .datalad/config
with the remote version would update all image
configs with the new version.
Possible ways:
- provide some
scripts/freeze_containers
script which would make a duplicate section for the container after its original section in.datalad/config
, e.g.:
[datalad "containers.bids-validator"]
updateurl = shub://ReproNim/containers:bids-validator--1.2.3
image = images/bids/bids-validator--1.2.3.sing
cmdexec = {img_dspath}/scripts/singularity_cmd run {img} {cmd}
...
### FROZEN CONTAINERS
[datalad "containers.bids-validator"]
image = images/bids/bids-validator--1.2.3.sing
so whenever a new version to be merged, most likely conflict would occur at the end of the file, but at least it would be easy to troubleshoot and original "full" record would get is new image
entry without affecting the end value of the image
for the container
- enhancement to above: We can prepopulate that trailing section within this dataset:
### FROZEN CONTAINERS
[datalad "containers.bids-validator"]
# end of datalad "containers.bids-validator"
[datalad "containers.bids-fmriprep"]
# end of datalad "containers.bids-fmriprep"
...
### END OF FROZEN CONTAINERS
and make sure that for every container (which is to stay above the ### FROZEN CONTAINERS
) we add we also add this blank section within it. Then merges should proceed fine, users would be able to freeze needed containers. So the only thing we would need is within this repo make sure that new containers entries are added correctly and then add that script which would also need to understand this format to add new image entries.
- provide some
scripts/freeze_containers
script which would adjustimage
entries within.datalad/config
for specified/all containers so it would cause conflict upon merge and require conscious conflict resolution (or justgit merge -S ours
, but I am afraid trailing hunk could swallow then newly added container configs) to decide to either upgrade specific image version to the new one or not. There could even be some custom merge helper to perform merge by simply adopting only new sections of the config
Any other way @kyleam @mih @bpoldrack which might come to your mind?
I was not sure if that would be anything to tackle at datalad-containers level, since more relevant to such "datalad containers" distribution) so decided to file here first.
Hm. To me looks somewhat dirty at a first glance. After all, we have a version controlled dataset and building some kind of a hackish "version control" on top here. Any old version would still have reference in earlier commits - technically it's not lost on such an update. We should take advantage of what's there, I think. Now, depends on what you mean by "still needed". References to old version should simply reference the commit instead of just a path and thereby you'd still have all you need.
If it is about having two versions available in the worktree, then I think you should just use two container (sub-)datasets. You could have two subdatasets where one has the newer version available in HEAD and the other one is not updated wrt to that image. Referencing names for containers in subdatasets would distinguish both. I think, that's the way to go if you need multiple versions.
Clarification: Of course, it doesn't need subdatasets. You can also simply reference two images within a single dataset.
But may be I misunderstood your aim. If it's just about not updating them (not keeping two versions, but the old one only), then update
should be used with path arguments to specify what to update. and what not. If it's about some kind of configuration/automation of exactly that, then I don't see why containers would anyhow be special. This should have a configuration for update
itself, like you can configure git push
for example, by specifying patterns of what to push where. Should look similar for update
in my opinion.
In case that was just another misinterpretation of what you want to achieve, I need further explanation of the goal ;-)
I think this situation is no different from a general need to have multiple versions of a file simultaneously accessible. The low-tech solution is to encode the version into the file name. If there is a container dataset that aims to provide multiple version simultaneously, I see no reason not to use this approach.
If it is just about a dataset that used containers from a specific commit of a container dataset, this information is already encoded in a past commit. I do not see why this has to be maintained in the worktree. Or how it could be maintained in the worktree, as any update/merge/whatever needs to look into any such file and make sure that previous content gets preserved.
I'd stay clear of file content manipulation and add additional (config) files, if needed.
Thank you @bpoldrack and @mih -- I will digest it better and reply in greater detail.
Re maintaining multiple versions in the same tree -- although YODA should (eventually) rule the world, I see this "distribution" dataset also valuable for folks with centralized deployment where they could reuse this containers as is even if they don't embed it into their datasets (since might not use them yet). Also having these multiple versions allows for the use case I am targeting with this issue - ability to gain access to newer versions of the containers while still being capable easily to use previous ones (e.g. for consistency of operation in ongoing study). It is like using Debian unstable but not upgrading all packages at once but only selected ones, "until ready" for the full upgrade. So the situation here is a bit different from a "dataset versioning" case where you do want to upgrade entire dataset. Here you might want to "upgrade" to execute only some new containers.
[ I'm still not sure how I feel about this repository's approach of storing each version in the working tree, but assuming that setup and focusing on the possible approaches for selective freezing of a subset of containers... ]
Talking to you in person today, I suggested using git config's include.path
to point to a separate config file for the frozen overrides, but then you pointed out that git config --file ...
ignores include
statements by default (and thus dataset.config
does because it passes .datalad/config
as the file). We also mentioned using the untracked .git/config
, which would work in terms of the initial run and provenance information but wouldn't be ideal in the sense that the frozen config wouldn't persist across clones. It just occurred to me that a hybrid setup might work well:
- have a tracked config file under
.datalad/
that contains the overrides for the frozen containers - enable frozen containers in a local repository with an
include.path
in.git/config
that points the tracked config file
That would allow you to manage the custom config in a tracked file without worrying about conflicts. And testing this out quickly, DataLad's config system and datalad-containers seem fine with it.