WIP: Refactor pipeline downloads command
This is a draft PR for refactoring nf-core pipelines download for readability and to use the nextflow inspect command for container detection. It builds upon the excellent work of @MatthiasZepper, @muffato (#3509), @JulianFlesch (refactor/pipeline-download), and @nikhil (#3517).
Integrated changes from contributors:
- Base container fetching primarily on
nextflow inspect - Add
SingularityFetcherclass for fetching singularity images - Add
DockerFetcherclass for fetching docker images
Added and planned changes to the download codebase
- [ ] Refactor
download.py:- [x] Split out the
WorkflowRepoclass - [ ] Support several container systems (
docker/singularity) - [x] Move regex logic to
download/utils.py. (Or remove entirely. We should decide if we want to have a transitory period where we default to regex ifnextflow inspectfails) - [x] Add checks and nice reporting for required Nextflow version for
nextflow inspect
- [x] Split out the
- [ ] Refactor
SingularityFetcherandDockerFetcher. There is a lot of duplication so we should probably add a superclass "ContainerFetcher" with shared logic and abstract methods. The same is true for the Error classes.
Tests
The nextflow inspect command is required to run on a full pipeline: the command uses main.nf as the entry point by default and then checks what modules are imported in any workflows or subworkflows that are used.
This means that out current tests which are based on stripped modules do not work.
I therefore plan to remove the old test data and replace it with a pipeline skeleton in mock_pipeline_containers (might be renamed) with the following specs
- The container URI that
nextflow inspectare located inmodules/mock-modules/-- I have structured it so that it mirrors the typical structure of local modules in an nf-core pipeline. - The tested URIs are taken from the containers present in
mock_module_containers: all container strings are captured correctly except for themock_dsl2_variable.nf, presumably sincenextflow inspectdoes not run any code in thescriptsection, meaning thatcontainer_idis resolved tonull. Since variable container id seems to be legacy, I do not think we need to support this test. - I have left much of the Nextflow code added by the
nf-core createcommand -- we could definitely make this folder leaner if that is desired. - Correct behaviour is tested in
tests/download/DownloadTest.test_containers_pipeline_<container>where<container>is eithersingularityordocker. In these tests the output fromnextflow inspect . -profile <container>is compared to the true list of containers which is kept inmock_pipeline_containers/per_profile_outputas json file.
Discussion points related to this:
- Should we support container definitions in configs? While
nextflow inspectwill capture them if they are configured correctly I am not sure if it is desired behavior.- Since we will likely make the container definitions more strict (Move to seqera containers 1 and Move to seqera containers 2), I would presume this is the case, but then we need to decide on backwards compatibility.
- If we want to support it, then we need to port the tests to ensure that containers are capture correctly
PR checklist
- [ ] This comment contains a description of changes (with reason)
- [ ]
CHANGELOG.mdis updated - [ ] If you've fixed a bug or added code that should be tested, add tests!
- [ ] Documentation in
docsis updated
Codecov Report
:x: Patch coverage is 67.67276% with 407 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 77.69%. Comparing base (76b5635) to head (384a6ea).
:warning: Report is 124 commits behind head on dev.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Current failing tests are not dure to this PR, I am fixing then in a separate one https://github.com/nf-core/tools/actions/runs/16113859276/job/45463246665?pr=3634
@mirpedrol Thanks for letting me know!
Thanks for all the feedback @MatthiasZepper ! I'll try to summarize your suggestions here to open up for further discussions and to set a path for finishing this PR:
- [x] Check logic of
amend_cachedirflag.- Check comment below
- [x] Check backwards compatibility of new
--parallelflag.- Should be fine since the minor flag is kept. But we need to check if the pipeline repos use the
devbranch for download testing.
- Should be fine since the minor flag is kept. But we need to check if the pipeline repos use the
- [x] Check prefixing libraries for
DockerFetcher.- Support for libraries will not be supported for docker initially. We should consider this after the move to Seqera containers.
- [x] Switch
os.pathtopathlib - [x] Check if
get_adressshould be abstract or just inSingularityFetcher- Should just be in
SingularityFetcher:nextflow inspecthandles this for docker images.
- Should just be in
- [x] Run
docker pullin parallel - [x] Fix blocking due to bad concurrency handling with
save_futures - [x] Check handling of absolute URIs for
DockerFetcher- They are always absolute from what I can tell.
- [x] Do not select
Singularitysilently - [x] Figure out how we should handle different registries, and how the move to Seqera containers affects this
- I have not made any change to how Singularity handles different registries. For Docker there is no support for different registries at all, and if we want to add it I think it should be in a later PR.
- [x]
Supportremotecontainer option for docker- This is postponed
- [x] Make regex on line 564 not singularity specific
- I changed the name of the function to
read_remote_singularity_containerssince we do not support remote containers for Docker at the moment.
- I changed the name of the function to
- [x] Check benefit of compression of docker images (check change in filesize).
- This seems beneficial still. I got a 3x size reduction when compressing the
docker-imagesdirectory totar.gzfor a (fairly small) pipeline (phyloplace 2.0.0).
- This seems beneficial still. I got a 3x size reduction when compressing the
- [x] Run
nextflow inspectwithtest|test_fullprofiles as well to capture all containersnextflow inspectis now invoked withprofile <container-system>,test,test_full. I am not sure if this is reliable enough
- [x] Make
download.pyless singularity specific. - [x]
Remove.sifextension- This seems to have been around for a while, should it really be removed?
- [x] Fix order in
download/utils.py - [x] Move generic code from
download/utils.pytoutils.py - [x] Check use of
DownloadProgressbetween singularity and docker - [x] Remove error message and test for variable in container string
Update on the logic of amend cachedir @MatthiasZepper:
From reading the documentation more closely it seems to me that if amend is specified then we should not copy the files to the output directory ((docker|singularity)-images) since in this case we only want the files in the directory specified by $NXF_SINGULARITY_CACHEDIR, and this explains the logic you commented on above. Since I've made some changes here are the lines at the current revision.
Legacy code is removed as of c898730.
bb84582 adds ignores to the directories null/ and .nextflow/ in tests/pipelines. I have decided to not clean this up from the tools side since they are created by Nextflow not our own code, and might not have predicable names. Perhaps we should raise an issue on the Nextflow repo regarding these created files?
Follow up issues:
amendoption for Docker: should only pull containers- Container libraries for Docker:
nextflow inspectwill return the absolute URI regardless of whether the URI in container directive is absolute or not. Perhaps this is better addressed when switching to Seqera containers? - Move
gather_registriestoContainerFetcherclass or subclasses. EDIT: WIP pr at #3696
I've checked the template for the the gh workflow for downloading the pipeline and it seems that it does depend on the dev branch @mashehu. However it apparently does not use the --parallel-downloads flag so it should not break. However, I figure that merging this would mean that the new downloads code is tested on all pipelines as soon as someone creates an event that triggers the workflows, which is a bit scary. What do you think is the best plan of action?
Container libraries for Docker:
nextflow inspectwill return the absolute URI regardless of whether the URI in container directive is absolute or not. Perhaps this is better addressed when switching to Seqera containers?
I didn't know that. This makes the --container-library parameter essentially useless, because the libraries/registries are no longer handled by tools, but by Nextflow.
The only thing these would still be required are the symlinks, which are anyway not my favorite solution. So once the Seqera Containers are adopted, we might not need that anymore either.
Docs updated in: https://github.com/nf-core/website/pull/3463
Hello ! That's an impressive PR, congrats !
I'm testing the feature on a few pipelines and noticed that there are some leftover files from running nextflow
$ env NXF_SINGULARITY_CACHEDIR='' nf-core pipelines download -r 1.11.0 -o fngs -x none -c yes -s singularity -d 4 nf-core/fetchngs
(...)
$ find .nextflow/ work/ null/
.nextflow/
.nextflow/cache
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db/000003.log
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db/LOCK
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db/MANIFEST-000002
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db/CURRENT
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/index.kickass_leibniz
.nextflow/plr
work/
null/
null/pipeline_info
null/pipeline_info/execution_timeline_2025-08-16_14-29-22.html
null/pipeline_info/pipeline_dag_2025-08-16_14-29-22.html
null/pipeline_info/execution_report_2025-08-16_14-29-22.html
null/pipeline_info/execution_trace_2025-08-16_14-29-22.txt
(null is the default value of outdir for this pipeline. For other pipelines it could be something else)
Have you noticed that ?
Also, images are created with permissions rw------- (and it's not a umask thing from my environment).
$ ls -l fngs/singularity-images/ | grep -v '^l'
total 523852
-rw------- 1 mm49 team328 154169344 Aug 16 14:32 mulled-v2-5f89fe0cd045cb1d615630b9261a1d17943a9b6a-2f4a4c900edd6801ff0068c2b3048b4459d119eb-0.img
-rw------- 1 mm49 team328 62779423 Aug 16 14:32 python-3.9--1.img
-rw------- 1 mm49 team328 137523200 Aug 16 14:32 sra-tools-2.11.0--pl5321ha49a11a_3.img
-rw------- 1 mm49 team328 154169344 Aug 16 14:32 sra-tools-3.0.8--h9f5acd7_0.img
-rw------- 1 mm49 team328 27750400 Aug 16 14:32 ubuntu-20.04.img
$ umask
0022
$ touch test
$ ls -l test
-rw-r--r-- 1 mm49 team328 0 Aug 16 14:34 test
Finally, to comment on https://github.com/nf-core/tools/pull/3634#discussion_r2195332147 and https://github.com/nf-core/tools/pull/3634#discussion_r2223736610 , here are my observations from testing combinations of cache/library parameters and options.
- The proposed implementation/purpose of
-u copyis to create a standalone copy of the pipeline. Thus, its localsingularity-images/directory must be complete, andnextflow.configmust be modified to point at it. The pipeline can then run without a cache/library directory externally set. Rightly, the command takes images from the pre-existing cache or library if possible, rather than downloading absolutely everything. The current implementation also copies images that are downloaded to the cache directory. This has no runtime effect but honours the purpose of the cache/library: after the command is run, they together hold all images as well, thus helping future / other invocations of Nextflow. Note that the cache may not hold all images itself, as some may rather be in the library only. - The proposed implementation/purpose of
-u amendis to create a copy of the pipeline that entirely relies on the external cache alone. That's why the localsingularity-images/directory is left empty and all images are rather deposited into the cache.nextflow.configis not modified. Here also, the command only downloads the images that are not present in the cache/library. Contrary to-u copy, images are copied to the cache to make it complete on its own, rather than allowing some images to be in the library only. - However I don't understand the implementation/purpose of skipping the
-uoption. When I runnf-core pipelines downloadwithout-u, the localsingularity-images/directory is made complete butnextflow.configisn't updated, meaningsingularity-images/is not used. Fortunately, like-u copy, the cache is updated, so the pipeline can still run offline provided the cache (and the library) are set externally. It's essentially something in between-u copyand-u amend, but I can't define the rationale.
My thinking is that:
- In
-u amendmode, no need to duplicate images from the library to the cache. In other words, assume that the pipeline will run with both the cache and library set the same way as innf-core pipelines download, rather than only the cache. - In
-u amendmode, no need to create a localsingularity-images/directory if it's left empty. - Don't allow
-uto be unset ?
Matthieu
Hi @ErikDanielsson Does this support authenticated download from a private GitHub repo now? It would be a very easy fix, if instead of curling the repo tar.gz via githubs api url, one just uses the GitHubRepo class that nf-core tools uses everywhere else. I have the feeling this could be part of a nf-core download refactor as well. You can use my minimal PR: https://github.com/nf-core/tools/pull/3607
REf. issue: #2406
Hi folks! Thanks for the reviews and comments @muffato and @jpfeuffer. This work was done by @ErikDanielsson on a summer placement and he's now finished.
To avoid this PR going stale, @JulianFlesch will do the final tidy up and merge, and move your two comments into new issues to address post-merge. Hope that's ok!