scrnaseq icon indicating copy to clipboard operation
scrnaseq copied to clipboard

Adding cellrangermulti subworkflow

Open fmalmeida opened this issue 2 years ago • 18 comments

Close #247 Close #313

PR checklist

  • [ ] This comment contains a description of changes (with reason).
  • [ ] If you've fixed a bug or added code that should be tested, add tests!
  • [ ] If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • [ ] If necessary, also make a PR on the nf-core/scrnaseq branch on the nf-core/test-datasets repository.
  • [ ] Make sure your code lints (nf-core lint).
  • [ ] Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • [ ] Usage Documentation in docs/usage.md is updated.
  • [ ] Output Documentation in docs/output.md is updated.
  • [ ] CHANGELOG.md is updated.
  • [ ] README.md is updated (including new tool citations and authors/contributors).

Context

Hi guys,

Although not finished yet because it would still required updating the parameters schema, defaults and documentation, I am already opening the PR so we can all take a look at it and discuss any modifications required before merging and also, give it a round of tests and define how we want some parameters to be.

I used the templates provided by @klkeys

Usage context image

samplesheet To use it, samplesheet requires an additional parameter so that we can properly mix the different feature types given per sample.

sample,fastq_1,fastq_2,feature_type,protocol,expected_cells
PBMC_10K,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/10k_pbmc/fastqs/5gex/5gex/subsampled_sc5p_v2_hs_PBMC_10k_5gex_S1_L001_R1_001.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/10k_pbmc/fastqs/5gex/5gex/subsampled_sc5p_v2_hs_PBMC_10k_5gex_S1_L001_R2_001.fastq.gz,gex,SC5P-PE,1000
PBMC_10K,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/10k_pbmc/fastqs/bcell/subsampled_sc5p_v2_hs_PBMC_10k_b_S1_L001_R1_001.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/10k_pbmc/fastqs/bcell/subsampled_sc5p_v2_hs_PBMC_10k_b_S1_L001_R2_001.fastq.gz,vdj,SC5P-PE,1000
PBMC_10K,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/10k_pbmc/fastqs/5gex/5fb/subsampled_sc5p_v2_hs_PBMC_10k_5fb_S1_L001_R1_001.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/10k_pbmc/fastqs/5gex/5fb/subsampled_sc5p_v2_hs_PBMC_10k_5fb_S1_L001_R2_001.fastq.gz,ab,SC5P-PE,1000
PBMC_10K_CMO,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/10k_pbmc_cmo/fastqs/gex_1/subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_1_gex_S2_L001_R1_001.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/10k_pbmc_cmo/fastqs/gex_1/subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_1_gex_S2_L001_R2_001.fastq.gz,gex,SC3Pv3,1000
PBMC_10K_CMO,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/10k_pbmc_cmo/fastqs/cmo/subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_1_multiplexing_capture_S1_L001_R1_001.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/10k_pbmc_cmo/fastqs/cmo/subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_1_multiplexing_capture_S1_L001_R2_001.fastq.gz,cmo,SC3Pv3,1000
PBMC_10K_CMV,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/5k_cmvpos_tcells/fastqs/gex_1/subsampled_5k_human_antiCMV_T_TBNK_connect_GEX_1_S1_L001_R1_001.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/5k_cmvpos_tcells/fastqs/gex_1/subsampled_5k_human_antiCMV_T_TBNK_connect_GEX_1_S1_L001_R2_001.fastq.gz,gex,SC5P-R2,1000
PBMC_10K_CMV,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/5k_cmvpos_tcells/fastqs/ab/subsampled_5k_human_antiCMV_T_TBNK_connect_AB_S2_L004_R1_001.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/5k_cmvpos_tcells/fastqs/ab/subsampled_5k_human_antiCMV_T_TBNK_connect_AB_S2_L004_R2_001.fastq.gz,ab,SC5P-R2,1000
PBMC_10K_CMV,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/5k_cmvpos_tcells/fastqs/vdj/subsampled_5k_human_antiCMV_T_TBNK_connect_VDJ_S1_L001_R1_001.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/10xgenomics/cellranger/5k_cmvpos_tcells/fastqs/vdj/subsampled_5k_human_antiCMV_T_TBNK_connect_VDJ_S1_L001_R2_001.fastq.gz,vdj,SC5P-R2,1000

Supporting files

Right now, all the supporting files have been added as parameters, for example, cmo_barcode_csv, beam_antigen_csv, etc. .... which means they will work in a dataset manner, being the same for everything given in the samplesheet, instead of samplesheet base if they were added as columns in the samplesheet.

My main question here is, what should it be the desired approach?

Other stuff Of course there might still have things to clear or finish that I might have overlooked since there is quite a lot on it, so, I request your help on spotting it.

testing case required the full genomes from ensembl, otherwise, the analysis using VDJ was failling.

fmalmeida avatar Nov 23 '23 11:11 fmalmeida

Python linting (black) is failing

To keep the code consistent with lots of contributors, we run automated code consistency checks. To fix this CI test, please run:

  • Install black: pip install black
  • Fix formatting errors in your pipeline: black .

Once you push these changes the test should pass, and you can hide this comment :+1:

We highly recommend setting up Black in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!

Thanks again for your contribution!

github-actions[bot] avatar Nov 23 '23 11:11 github-actions[bot]

nf-core lint overall result: Passed :white_check_mark: :warning:

Posted for pipeline commit 4d9f17e

+| ✅ 203 tests passed       |+
#| ❔   4 tests were ignored |#
!| ❗   4 tests had warnings |!

:heavy_exclamation_mark: Test warnings:

  • nextflow_config - Config manifest.version should end in dev: 2.6.0
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!

:grey_question: Tests ignored:

:white_check_mark: Tests passed:

Run details

  • nf-core/tools version 2.14.1
  • Run at 2024-05-22 17:08:54

github-actions[bot] avatar Nov 23 '23 11:11 github-actions[bot]

Hi guys, @grst , @klkeys and @apeltzer , Do you guys have any other comments on this one? Specially about the following PR in nfcore/modules https://github.com/nf-core/modules/pull/4802 ?

As far as I see, this is just the one main blocker, so would be:

  • Solving this nfcore/modules PR, which relates to the comments done about the fastqs renaming module (here).
    • Top priority
  • Once solving this one, I would just solve the other comments made by Greg so that is finished. Unless you guys have other comments on it.

fmalmeida avatar Feb 12 '24 09:02 fmalmeida

@klkeys, @apeltzer for me an open discussion is whether the renaming should be a separate module or not - would appreciate additional opinions: https://github.com/nf-core/modules/pull/4802#pullrequestreview-1843492577

In the past I opted for including the renaming into the cellranger module for efficiency (avoiding heavy file transfers on cloud setups), but from a maintenance perspective it would be attractive to separate it, since it can be applied to all 10x workflows (i.e. spaceranger, cellranger, cellranger arc, cellranger multi).

grst avatar Feb 12 '24 09:02 grst

Hi @grst ; @klkeys and @apeltzer .

background changes

I have solved almost all the comments that were opened and updated the code. The workflow has been tested and it seems to be working.

There is a new testing profile that you can use for checking it out during review:

nextflow run ../main.nf -profile docker,test_cellrangermulti --outdir ./result

But, would also be properly recommended to trigger it manually for another dataset.

open questions

Although most was resolved, there are still some things open for resolution:

  • The cellranger_multi module (and sub-workflow) contain a set of additional CSVs that can be given as input. Currently they are treated as parameters, meaning is one for all the samples in the input samplesheet. A question here is, will this be the case, or will we have one per sample?
    • See here: https://github.com/nf-core/scrnaseq/pull/276#discussion_r1413510264
  • Are we going to add the testing cmo barcodes to test_datasets now?
    • See here: https://github.com/nf-core/scrnaseq/pull/276#discussion_r1413499008
  • What will the colum name we will keep?
    • See here: https://github.com/nf-core/scrnaseq/pull/276#discussion_r1413497114

fmalmeida avatar Mar 01 '24 08:03 fmalmeida

See here for a full upgrade of all cellranger modules. https://github.com/nf-core/modules/pull/5016

apeltzer avatar Mar 01 '24 10:03 apeltzer

Hi @grst ,

FYI.

I will include in this PR, the fix for cellranger/kallisto subworkflows to properly parse the raw/filtered matrices, splitting the channels, without the need for having multiple output channel directives in the modules.

This happen during this here https://github.com/nf-core/modules/pull/5238 which showed that this was not wanted. So we reverted the modules.

But, after some day on it, I could find a way of splitting so we could pass raw/filtered information to the downstream modules to integrate properly with the new emptydrops work.

I will include it in this PR, the fix because it is are a few lines at the end of the sub-workflow, which is basically the same being added in this multi-subworkflow. And was during this PR that I found the fix.

fmalmeida avatar Mar 21 '24 11:03 fmalmeida

Hi @grst , @klkeys and @apeltzer ,

I have finalized the update of the modules and making sure the workflow runs. The pipeline runs for all the main aligners. I tested:

  • alevin
  • star
  • kallisto standard
  • kallisto standard with filtering
  • kallisto nac
  • kallisti lamanno
  • cellranger
  • cellrangermulti

Since I cannot share the results because they are big for attaching in the comment, I am adding a file containing the tree structure of all the testing runs I generated.

runs_tree_structure.txt

The workflow is updated and running for all of them. Cellranger multi is running as described in the first comment of the PR. The only open points thus far are the ones highlighted here: https://github.com/nf-core/scrnaseq/pull/276#issuecomment-1972760047

So, since it is working, and a few points are open, I consider this PR ready for a proper review, thus I am promoting it from draft to ready.

Wait your review and feedback.

nf-test

Although the github action fail, it runs through if I execute in gitpod: image

fmalmeida avatar Apr 02 '24 10:04 fmalmeida

Hi @grst , I do not quite get it this one: https://github.com/nf-core/scrnaseq/pull/276#discussion_r1551764285

I checked the current input samplesheet, and these values are not available there. But, and I saw that you tagged the three input csvs that the module receives. Are you referring to one specifically, or to all?

I am not sure I quite understood. Also, I am not the expert on these csvs, so do not know what they should be. Currently the tests have only the cmo barcodes csv. So, maybe @klkeys would help more with that.

Maybe we can have a chat the three of us to discuss this one and how to design it better now that we have something that runs?

fmalmeida avatar Apr 05 '24 07:04 fmalmeida

Currently, emptydrops will not be performed for cellranger/multi as I am not sure if it is relevant. https://github.com/nf-core/scrnaseq/blob/247-support-for-10x-ffpe-scrna/workflows/scrnaseq.nf#L278

I think it's still relevant, at least when it runs in FFPE or CMO mode (probably not so much in CRISPR/BEAM/ANTIBODY mode, but I don't have experience with these data types). But this can also be added in a follow-up PR.

grst avatar Apr 15 '24 08:04 grst

Hi @grst and @klkeys ,

Summary of latest code.

  • [x] An FFPE dataset was added in the test dataset to allow proper parsing
  • [x] During FFPE analysis, a few bugs and unproper handling of optional files in the cellranger/multi module were observed, and, a new PR was added here for that: https://github.com/nf-core/modules/pull/5542
  • There are a few TODOs in the work, that we may resolve in this PR, or in another, they are:
    • [ ] https://github.com/nf-core/scrnaseq/blob/247-support-for-10x-ffpe-scrna/subworkflows/local/align_cellrangermulti.nf#L24-L26
    • [x] https://github.com/nf-core/scrnaseq/blob/247-support-for-10x-ffpe-scrna/subworkflows/local/align_cellrangermulti.nf#L228
    • [x] https://github.com/nf-core/scrnaseq/blob/247-support-for-10x-ffpe-scrna/workflows/scrnaseq.nf#L290
    • [x] https://github.com/nf-core/scrnaseq/blob/247-support-for-10x-ffpe-scrna/nextflow_schema.json#L328-L355 where I need help in providing description to files.
  • [x] Finally, it also needs the nf-tests to be made ( I am currently working on them )
  • [x] Once all above is solved, to add documentation.

fmalmeida avatar Apr 30 '24 11:04 fmalmeida

Hi @grst, Following up the previous message. nf-test has now been added and is working with github actions (discussion here).

One question that comes up from that, is: whether should we add the ensembl references used in the nf-core/test-datasets or not.

fmalmeida avatar May 01 '24 10:05 fmalmeida

@nf-core-bot fix linting

fmalmeida avatar May 01 '24 14:05 fmalmeida

One question that comes up from that, is: whether should we add the ensembl references used in the nf-core/test-datasets or not.

Ideally, yes, but I don't really care tbh ;)

grst avatar May 02 '24 07:05 grst

HI @fmalmeida, @grst

I successfully completed the pipeline (5' Inmune profiling), which ran for approximately 7 hours. Notably, the CELLRANGER:MULTI process took the longest, spanning 5 hours and 45 minutes.

I had to restrict the run to half of my samples. Working with the ab data of 13.7GB and gex data of 59GB. When working with the total amount of data, the pipeline broke under the error "no disk space in scracht folder" despite having 200T available. Is this expected behavior?

Additionally, I observed that the specific process is allocated a time of 16 hours. In future runs, I aim to process 48 samples. Is it possible to adjust this setting for the extended sample size?

NNazeu avatar May 06 '24 10:05 NNazeu

@nf-core-bot fix linting

fmalmeida avatar May 06 '24 13:05 fmalmeida

HI @fmalmeida, @grst

I successfully completed the pipeline (5' Inmune profiling), which ran for approximately 7 hours. Notably, the CELLRANGER:MULTI process took the longest, spanning 5 hours and 45 minutes.

I had to restrict the run to half of my samples. Working with the ab data of 13.7GB and gex data of 59GB. When working with the total amount of data, the pipeline broke under the error "no disk space in scracht folder" despite having 200T available. Is this expected behavior?

Additionally, I observed that the specific process is allocated a time of 16 hours. In future runs, I aim to process 48 samples. Is it possible to adjust this setting for the extended sample size?

Hi @grst , You said you were able to run on one of your datasets. Do you know if this is normal? Did you also observe something like this, with so gigantic data generation?

I am not used to run this tool, so I do not know if this is expected or not.

Finally, @NNazeu , for you latter inquiry, you can modify anything from the configuration with a pattern like this one in a custom config file:

process {
    withName: CELLRANGER_MULTI {
        time = 24.h
    }
} 

Thanks for testing the PR.

fmalmeida avatar May 07 '24 12:05 fmalmeida

I did observe the long runtime and maybe we should consider requesting more CPUs for cellranger multi. I didn't observe any excessive disk usage, at least it stayed well below a terabyte.

@NNazeu, are you sure you aren't hitting any disk quotas?

grst avatar May 07 '24 12:05 grst

Ok. I think I'm done with the documentation.

grst avatar May 15 '24 09:05 grst

I don't think lint will pass before the template update is merged. However, since this is such massive PR, I'd do that after this one gets merged to dev to avoid further disruptions.

grst avatar May 17 '24 11:05 grst

Are we there then? I don't think there's anything pending for now.

Hi @grst,

I think so. The last things were the documentation and the things you had brought. There are a few TODOs but can be taken later. As the BEAM data.

This was the comment summarising it: https://github.com/nf-core/scrnaseq/pull/276#issuecomment-2084997587

We can have a last look on that, and open follow-ups, otherwise, it is.

fmalmeida avatar May 17 '24 12:05 fmalmeida

I have added a follow-up ticket for the open points: https://github.com/nf-core/scrnaseq/issues/332

fmalmeida avatar May 23 '24 07:05 fmalmeida