Pipeline download purpose of cache-utilization parameters after refactoring
Note: This was commented on main the downloads functionality pull request https://github.com/nf-core/tools/pull/3634
To comment on https://github.com/nf-core/tools/pull/3634#discussion_r2195332147 and https://github.com/nf-core/tools/pull/3634#discussion_r2223736610 , here are my observations from testing combinations of cache/library parameters and options.
- The proposed implementation/purpose of
-u copyis to create a standalone copy of the pipeline. Thus, its localsingularity-images/directory must be complete, andnextflow.configmust be modified to point at it. The pipeline can then run without a cache/library directory externally set. Rightly, the command takes images from the pre-existing cache or library if possible, rather than downloading absolutely everything. The current implementation also copies images that are downloaded to the cache directory. This has no runtime effect but honours the purpose of the cache/library: after the command is run, they together hold all images as well, thus helping future / other invocations of Nextflow. Note that the cache may not hold all images itself, as some may rather be in the library only. - The proposed implementation/purpose of
-u amendis to create a copy of the pipeline that entirely relies on the external cache alone. That's why the localsingularity-images/directory is left empty and all images are rather deposited into the cache.nextflow.configis not modified. Here also, the command only downloads the images that are not present in the cache/library. Contrary to-u copy, images are copied to the cache to make it complete on its own, rather than allowing some images to be in the library only. - However I don't understand the implementation/purpose of skipping the
-uoption. When I runnf-core pipelines downloadwithout-u, the localsingularity-images/directory is made complete butnextflow.configisn't updated, meaningsingularity-images/is not used. Fortunately, like-u copy, the cache is updated, so the pipeline can still run offline provided the cache (and the library) are set externally. It's essentially something in between-u copyand-u amend, but I can't define the rationale.
My thinking is that:
- In
-u amendmode, no need to duplicate images from the library to the cache. In other words, assume that the pipeline will run with both the cache and library set the same way as innf-core pipelines download, rather than only the cache. - In
-u amendmode, no need to create a localsingularity-images/directory if it's left empty. - Don't allow
-uto be unset ?
Originally posted by @muffato in https://github.com/nf-core/tools/issues/3634#issuecomment-3193885334
Hi @muffato thanks for your comments!
There is currently ["amend", "copy", "remote"]
The proposed implementation/purpose of -u amend is to create a copy of the pipeline that entirely relies on the external cache alone. That's why the local singularity-images/ directory is left empty and all images are rather deposited into the cache. nextflow.config is not modified. Here also, the command only downloads the images that are not present in the cache/library. Contrary to -u copy, images are copied to the cache to make it complete on its own, rather than allowing some images to be in the library only.
This should be handled by the env variable NXF_SINGULARITY_CACHEDIR, no? If you set it before running your pipeline, then singularity should check there for images and find them cached 🤔
In -u amend mode, no need to duplicate images from the library to the cache. In other words, assume that the pipeline will run with both the cache and library set the same way as in nf-core pipelines download, rather than only the cache.
Not sure that is being done should only use the NXF_SINGULARITY_CACHEDIR
In -u amend mode, no need to create a local singularity-images/ directory if it's left empty.
Sure, that can be skipped 👍
Don't allow -u to be unset ?
Better yet, set one of them as default, I would suggest "copy" as it would be the most expected behavior imho.
What are your thoughts on this @MatthiasZepper ? Are we missing something?
Better yet, set one of them as default, I would suggest "copy" as it would be the most expected behavior imho.
Yes, that is a good idea.
Ultimately, the cache utilization is an old parameter, that was initially a Boolean flag to switch between False (copy) and True (amend). Because I needed remote as third option, it was changed into the current form. Therefore, it may be worth considering, if we should replace it entirely with a new and possibly clearer parameter?
Just a quick draft, but something along the lines of --output-style / --output-spec with options like self-reliant, local-cache, remote-cache, local-library-and-cache etc. could work?
The proposed implementation/purpose of -u amend is to create a copy of the pipeline that entirely relies on the external cache alone. That's why the local singularity-images/ directory is left empty and all images are rather deposited into the cache. nextflow.config is not modified. Here also, the command only downloads the images that are not present in the cache/library. Contrary to -u copy, images are copied to the cache to make it complete on its own, rather than allowing some images to be in the library only.
This should be handled by the env variable
NXF_SINGULARITY_CACHEDIR, no? If you set it before running your pipeline, then singularity should check there for images and find them cached 🤔
That's not the problem. The problem is that with -u amend, unnecessary copies of the images are being made.
-u copy. The end result issingularity-images/being complete. That's perfect for standalone use and copy to other systems-u amend. The end result isNXF_SINGULARITY_CACHEDIRbeing complete, even ifNXF_SINGULARITY_LIBRARYDIRis set and already contains images. This duplicates images unnecessarily and it is at odds with Nextflow's own behaviour (Nextflow does not copy images fromNXF_SINGULARITY_LIBRARYDIRtoNXF_SINGULARITY_CACHEDIR).
In -u amend mode, no need to duplicate images from the library to the cache. In other words, assume that the pipeline will run with both the cache and library set the same way as in nf-core pipelines download, rather than only the cache.
Not sure that is being done should only use the
NXF_SINGULARITY_CACHEDIR
Why would we assume that the pipeline run will run in an environment where NXF_SINGULARITY_CACHEDIR is set the same way as in nf-core pipelines download but NXF_SINGULARITY_LIBRARYDIR would not ?
Therefore, it may be worth considering, if we should replace it entirely with a new and possibly clearer parameter?
Just a quick draft, but something along the lines of
--output-style/--output-specwith options likeself-reliant,local-cache,remote-cache,local-library-and-cacheetc. could work?
Renaming the parameter may help, but I don't think there are that many options. I think the only two options are:
- Make a complete local copy (what
-u copydoes)-u remoteis a special case where the user states that some images are already downloaded and can be skipped.
- Populate
NXF_SINGULARITY_CACHEDIR(-u amend) so that Nextflow's library+cache are complete.
Don't allow -u to be unset ?
Better yet, set one of them as default, I would suggest "copy" as it would be the most expected behavior imho.
Perfect suggestion 👍🏼