tools icon indicating copy to clipboard operation
tools copied to clipboard

WIP: Refactor pipeline downloads command

Open ErikDanielsson opened this issue 6 months ago • 1 comments

This is a draft PR for refactoring nf-core pipelines download for readability and to use the nextflow inspect command for container detection. It builds upon the excellent work of @MatthiasZepper, @muffato (#3509), @JulianFlesch (refactor/pipeline-download), and @nikhil (#3517).

Integrated changes from contributors:

  • Base container fetching primarily on nextflow inspect
  • Add SingularityFetcher class for fetching singularity images
  • Add DockerFetcher class for fetching docker images

Added and planned changes to the download codebase

  • [ ] Refactor download.py:
    • [x] Split out the WorkflowRepo class
    • [ ] Support several container systems (docker/singularity)
    • [x] Move regex logic to download/utils.py. (Or remove entirely. We should decide if we want to have a transitory period where we default to regex if nextflow inspect fails)
    • [x] Add checks and nice reporting for required Nextflow version for nextflow inspect
  • [ ] Refactor SingularityFetcher and DockerFetcher. There is a lot of duplication so we should probably add a superclass "ContainerFetcher" with shared logic and abstract methods. The same is true for the Error classes.

Tests

The nextflow inspect command is required to run on a full pipeline: the command uses main.nf as the entry point by default and then checks what modules are imported in any workflows or subworkflows that are used. This means that out current tests which are based on stripped modules do not work. I therefore plan to remove the old test data and replace it with a pipeline skeleton in mock_pipeline_containers (might be renamed) with the following specs

  • The container URI that nextflow inspect are located in modules/mock-modules/ -- I have structured it so that it mirrors the typical structure of local modules in an nf-core pipeline.
  • The tested URIs are taken from the containers present in mock_module_containers: all container strings are captured correctly except for the mock_dsl2_variable.nf, presumably since nextflow inspect does not run any code in the script section, meaning that container_id is resolved to null. Since variable container id seems to be legacy, I do not think we need to support this test.
  • I have left much of the Nextflow code added by the nf-core create command -- we could definitely make this folder leaner if that is desired.
  • Correct behaviour is tested in tests/download/DownloadTest.test_containers_pipeline_<container> where <container> is either singularity or docker. In these tests the output from nextflow inspect . -profile <container> is compared to the true list of containers which is kept in mock_pipeline_containers/per_profile_output as json file.

Discussion points related to this:

  • Should we support container definitions in configs? While nextflow inspect will capture them if they are configured correctly I am not sure if it is desired behavior.
    • Since we will likely make the container definitions more strict (Move to seqera containers 1 and Move to seqera containers 2), I would presume this is the case, but then we need to decide on backwards compatibility.
    • If we want to support it, then we need to port the tests to ensure that containers are capture correctly

PR checklist

  • [ ] This comment contains a description of changes (with reason)
  • [ ] CHANGELOG.md is updated
  • [ ] If you've fixed a bug or added code that should be tested, add tests!
  • [ ] Documentation in docs is updated

ErikDanielsson avatar Jun 23 '25 09:06 ErikDanielsson

Codecov Report

:x: Patch coverage is 67.67276% with 407 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 77.69%. Comparing base (76b5635) to head (384a6ea). :warning: Report is 124 commits behind head on dev.

Files with missing lines Patch % Lines
nf_core/pipelines/download/singularity.py 51.38% 193 Missing :warning:
nf_core/pipelines/download/download.py 73.65% 93 Missing :warning:
nf_core/pipelines/download/workflow_repo.py 67.34% 48 Missing :warning:
nf_core/pipelines/download/docker.py 75.00% 41 Missing :warning:
nf_core/pipelines/download/container_fetcher.py 87.97% 19 Missing :warning:
nf_core/pipelines/download/utils.py 65.78% 13 Missing :warning:

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jun 25 '25 12:06 codecov[bot]

Current failing tests are not dure to this PR, I am fixing then in a separate one https://github.com/nf-core/tools/actions/runs/16113859276/job/45463246665?pr=3634

mirpedrol avatar Jul 07 '25 10:07 mirpedrol

@mirpedrol Thanks for letting me know!

ErikDanielsson avatar Jul 07 '25 10:07 ErikDanielsson

Thanks for all the feedback @MatthiasZepper ! I'll try to summarize your suggestions here to open up for further discussions and to set a path for finishing this PR:

  • [x] Check logic of amend_cachedir flag.
    • Check comment below
  • [x] Check backwards compatibility of new --parallel flag.
    • Should be fine since the minor flag is kept. But we need to check if the pipeline repos use the dev branch for download testing.
  • [x] Check prefixing libraries for DockerFetcher.
    • Support for libraries will not be supported for docker initially. We should consider this after the move to Seqera containers.
  • [x] Switch os.path to pathlib
  • [x] Check if get_adress should be abstract or just in SingularityFetcher
    • Should just be in SingularityFetcher: nextflow inspect handles this for docker images.
  • [x] Run docker pull in parallel
  • [x] Fix blocking due to bad concurrency handling with save_futures
  • [x] Check handling of absolute URIs for DockerFetcher
    • They are always absolute from what I can tell.
  • [x] Do not select Singularity silently
  • [x] Figure out how we should handle different registries, and how the move to Seqera containers affects this
    • I have not made any change to how Singularity handles different registries. For Docker there is no support for different registries at all, and if we want to add it I think it should be in a later PR.
  • [x] Support remote container option for docker
    • This is postponed
  • [x] Make regex on line 564 not singularity specific
    • I changed the name of the function to read_remote_singularity_containers since we do not support remote containers for Docker at the moment.
  • [x] Check benefit of compression of docker images (check change in filesize).
    • This seems beneficial still. I got a 3x size reduction when compressing the docker-images directory to tar.gz for a (fairly small) pipeline (phyloplace 2.0.0).
  • [x] Run nextflow inspect with test|test_full profiles as well to capture all containers
    • nextflow inspect is now invoked with profile <container-system>,test,test_full. I am not sure if this is reliable enough
  • [x] Make download.py less singularity specific.
  • [x] Remove .sif extension
  • [x] Fix order in download/utils.py
  • [x] Move generic code from download/utils.py to utils.py
  • [x] Check use of DownloadProgress between singularity and docker
  • [x] Remove error message and test for variable in container string

ErikDanielsson avatar Jul 14 '25 13:07 ErikDanielsson

Update on the logic of amend cachedir @MatthiasZepper: From reading the documentation more closely it seems to me that if amend is specified then we should not copy the files to the output directory ((docker|singularity)-images) since in this case we only want the files in the directory specified by $NXF_SINGULARITY_CACHEDIR, and this explains the logic you commented on above. Since I've made some changes here are the lines at the current revision.

ErikDanielsson avatar Jul 21 '25 11:07 ErikDanielsson

Legacy code is removed as of c898730.

ErikDanielsson avatar Jul 22 '25 11:07 ErikDanielsson

bb84582 adds ignores to the directories null/ and .nextflow/ in tests/pipelines. I have decided to not clean this up from the tools side since they are created by Nextflow not our own code, and might not have predicable names. Perhaps we should raise an issue on the Nextflow repo regarding these created files?

ErikDanielsson avatar Jul 22 '25 13:07 ErikDanielsson

Follow up issues:

  • amend option for Docker: should only pull containers
  • Container libraries for Docker: nextflow inspect will return the absolute URI regardless of whether the URI in container directive is absolute or not. Perhaps this is better addressed when switching to Seqera containers?
  • Move gather_registries to ContainerFetcher class or subclasses. EDIT: WIP pr at #3696

ErikDanielsson avatar Jul 23 '25 09:07 ErikDanielsson

I've checked the template for the the gh workflow for downloading the pipeline and it seems that it does depend on the dev branch @mashehu. However it apparently does not use the --parallel-downloads flag so it should not break. However, I figure that merging this would mean that the new downloads code is tested on all pipelines as soon as someone creates an event that triggers the workflows, which is a bit scary. What do you think is the best plan of action?

ErikDanielsson avatar Jul 23 '25 09:07 ErikDanielsson

Container libraries for Docker: nextflow inspect will return the absolute URI regardless of whether the URI in container directive is absolute or not. Perhaps this is better addressed when switching to Seqera containers?

I didn't know that. This makes the --container-library parameter essentially useless, because the libraries/registries are no longer handled by tools, but by Nextflow.

The only thing these would still be required are the symlinks, which are anyway not my favorite solution. So once the Seqera Containers are adopted, we might not need that anymore either.

MatthiasZepper avatar Jul 23 '25 13:07 MatthiasZepper

Docs updated in: https://github.com/nf-core/website/pull/3463

ErikDanielsson avatar Aug 15 '25 09:08 ErikDanielsson

Hello ! That's an impressive PR, congrats !

I'm testing the feature on a few pipelines and noticed that there are some leftover files from running nextflow

$ env NXF_SINGULARITY_CACHEDIR='' nf-core pipelines download -r 1.11.0 -o fngs -x none -c yes -s singularity -d 4 nf-core/fetchngs
(...)
$ find .nextflow/ work/ null/
.nextflow/
.nextflow/cache
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db/000003.log
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db/LOCK
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db/MANIFEST-000002
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db/CURRENT
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/index.kickass_leibniz
.nextflow/plr
work/
null/
null/pipeline_info
null/pipeline_info/execution_timeline_2025-08-16_14-29-22.html
null/pipeline_info/pipeline_dag_2025-08-16_14-29-22.html
null/pipeline_info/execution_report_2025-08-16_14-29-22.html
null/pipeline_info/execution_trace_2025-08-16_14-29-22.txt

(null is the default value of outdir for this pipeline. For other pipelines it could be something else)

Have you noticed that ?


Also, images are created with permissions rw------- (and it's not a umask thing from my environment).

$ ls -l fngs/singularity-images/ | grep -v '^l'
total 523852
-rw------- 1 mm49 team328 154169344 Aug 16 14:32 mulled-v2-5f89fe0cd045cb1d615630b9261a1d17943a9b6a-2f4a4c900edd6801ff0068c2b3048b4459d119eb-0.img
-rw------- 1 mm49 team328  62779423 Aug 16 14:32 python-3.9--1.img
-rw------- 1 mm49 team328 137523200 Aug 16 14:32 sra-tools-2.11.0--pl5321ha49a11a_3.img
-rw------- 1 mm49 team328 154169344 Aug 16 14:32 sra-tools-3.0.8--h9f5acd7_0.img
-rw------- 1 mm49 team328  27750400 Aug 16 14:32 ubuntu-20.04.img
$ umask
0022
$ touch test
$ ls -l test
-rw-r--r-- 1 mm49 team328 0 Aug 16 14:34 test

Finally, to comment on https://github.com/nf-core/tools/pull/3634#discussion_r2195332147 and https://github.com/nf-core/tools/pull/3634#discussion_r2223736610 , here are my observations from testing combinations of cache/library parameters and options.

  1. The proposed implementation/purpose of -u copy is to create a standalone copy of the pipeline. Thus, its local singularity-images/ directory must be complete, and nextflow.config must be modified to point at it. The pipeline can then run without a cache/library directory externally set. Rightly, the command takes images from the pre-existing cache or library if possible, rather than downloading absolutely everything. The current implementation also copies images that are downloaded to the cache directory. This has no runtime effect but honours the purpose of the cache/library: after the command is run, they together hold all images as well, thus helping future / other invocations of Nextflow. Note that the cache may not hold all images itself, as some may rather be in the library only.
  2. The proposed implementation/purpose of -u amend is to create a copy of the pipeline that entirely relies on the external cache alone. That's why the local singularity-images/ directory is left empty and all images are rather deposited into the cache. nextflow.config is not modified. Here also, the command only downloads the images that are not present in the cache/library. Contrary to -u copy, images are copied to the cache to make it complete on its own, rather than allowing some images to be in the library only.
  3. However I don't understand the implementation/purpose of skipping the -u option. When I run nf-core pipelines download without -u, the local singularity-images/ directory is made complete but nextflow.config isn't updated, meaning singularity-images/ is not used. Fortunately, like -u copy, the cache is updated, so the pipeline can still run offline provided the cache (and the library) are set externally. It's essentially something in between -u copy and -u amend, but I can't define the rationale.

My thinking is that:

  1. In -u amend mode, no need to duplicate images from the library to the cache. In other words, assume that the pipeline will run with both the cache and library set the same way as in nf-core pipelines download, rather than only the cache.
  2. In -u amend mode, no need to create a local singularity-images/ directory if it's left empty.
  3. Don't allow -u to be unset ?

Matthieu

muffato avatar Aug 16 '25 20:08 muffato

Hi @ErikDanielsson Does this support authenticated download from a private GitHub repo now? It would be a very easy fix, if instead of curling the repo tar.gz via githubs api url, one just uses the GitHubRepo class that nf-core tools uses everywhere else. I have the feeling this could be part of a nf-core download refactor as well. You can use my minimal PR: https://github.com/nf-core/tools/pull/3607

REf. issue: #2406

jpfeuffer avatar Aug 22 '25 15:08 jpfeuffer

Hi folks! Thanks for the reviews and comments @muffato and @jpfeuffer. This work was done by @ErikDanielsson on a summer placement and he's now finished.

To avoid this PR going stale, @JulianFlesch will do the final tidy up and merge, and move your two comments into new issues to address post-merge. Hope that's ok!

ewels avatar Aug 26 '25 12:08 ewels