scrnaseq
scrnaseq copied to clipboard
Output 10x counts
This PR adds support to generate 10x count files as output (features.tsv, barcodes.tsv, and matrix.mtx) as part of the pipeline.
Issue: https://github.com/nf-core/scrnaseq/issues/66
PR checklist
- [x] This comment contains a description of changes (with reason).
- [ ] If you've fixed a bug or added code that should be tested, add tests!
- [ ] If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/scrnaseq branch on the nf-core/test-datasets repository.
- [x] Make sure your code lints (
nf-core lint
). - [x] Ensure the test suite passes (
nextflow run . -profile test,docker --outdir <OUTDIR>
). - [ ] Usage Documentation in
docs/usage.md
is updated. - [ ] Output Documentation in
docs/output.md
is updated. - [x]
CHANGELOG.md
is updated. - [x]
README.md
is updated (including new tool citations and authors/contributors).
nf-core lint
overall result: Passed :white_check_mark:
Posted for pipeline commit aae4f5e
+| ✅ 158 tests passed |+
:white_check_mark: Tests passed:
-
files_exist - File found:
.gitattributes
-
files_exist - File found:
.gitignore
-
files_exist - File found:
.nf-core.yml
-
files_exist - File found:
.editorconfig
-
files_exist - File found:
.prettierignore
-
files_exist - File found:
.prettierrc.yml
-
files_exist - File found:
CHANGELOG.md
-
files_exist - File found:
CITATIONS.md
-
files_exist - File found:
CODE_OF_CONDUCT.md
-
files_exist - File found:
CODE_OF_CONDUCT.md
-
files_exist - File found:
LICENSE
orLICENSE.md
orLICENCE
orLICENCE.md
-
files_exist - File found:
nextflow_schema.json
-
files_exist - File found:
nextflow.config
-
files_exist - File found:
README.md
-
files_exist - File found:
.github/.dockstore.yml
-
files_exist - File found:
.github/CONTRIBUTING.md
-
files_exist - File found:
.github/ISSUE_TEMPLATE/bug_report.yml
-
files_exist - File found:
.github/ISSUE_TEMPLATE/config.yml
-
files_exist - File found:
.github/ISSUE_TEMPLATE/feature_request.yml
-
files_exist - File found:
.github/PULL_REQUEST_TEMPLATE.md
-
files_exist - File found:
.github/workflows/branch.yml
-
files_exist - File found:
.github/workflows/ci.yml
-
files_exist - File found:
.github/workflows/linting_comment.yml
-
files_exist - File found:
.github/workflows/linting.yml
-
files_exist - File found:
assets/email_template.html
-
files_exist - File found:
assets/email_template.txt
-
files_exist - File found:
assets/sendmail_template.txt
-
files_exist - File found:
assets/nf-core-scrnaseq_logo_light.png
-
files_exist - File found:
conf/modules.config
-
files_exist - File found:
conf/test.config
-
files_exist - File found:
conf/test_full.config
-
files_exist - File found:
docs/images/nf-core-scrnaseq_logo_light.png
-
files_exist - File found:
docs/images/nf-core-scrnaseq_logo_dark.png
-
files_exist - File found:
docs/output.md
-
files_exist - File found:
docs/README.md
-
files_exist - File found:
docs/README.md
-
files_exist - File found:
docs/usage.md
-
files_exist - File found:
lib/nfcore_external_java_deps.jar
-
files_exist - File found:
lib/NfcoreSchema.groovy
-
files_exist - File found:
lib/NfcoreTemplate.groovy
-
files_exist - File found:
lib/Utils.groovy
-
files_exist - File found:
lib/WorkflowMain.groovy
-
files_exist - File found:
main.nf
-
files_exist - File found:
assets/multiqc_config.yml
-
files_exist - File found:
conf/base.config
-
files_exist - File found:
conf/igenomes.config
-
files_exist - File found:
.github/workflows/awstest.yml
-
files_exist - File found:
.github/workflows/awsfulltest.yml
-
files_exist - File found:
lib/WorkflowScrnaseq.groovy
-
files_exist - File found:
modules.json
-
files_exist - File found:
pyproject.toml
-
files_exist - File not found check:
Singularity
-
files_exist - File not found check:
parameters.settings.json
-
files_exist - File not found check:
.nf-core.yaml
-
files_exist - File not found check:
bin/markdown_to_html.r
-
files_exist - File not found check:
conf/aws.config
-
files_exist - File not found check:
.github/workflows/push_dockerhub.yml
-
files_exist - File not found check:
.github/ISSUE_TEMPLATE/bug_report.md
-
files_exist - File not found check:
.github/ISSUE_TEMPLATE/feature_request.md
-
files_exist - File not found check:
docs/images/nf-core-scrnaseq_logo.png
-
files_exist - File not found check:
.markdownlint.yml
-
files_exist - File not found check:
.yamllint.yml
-
files_exist - File not found check:
lib/Checks.groovy
-
files_exist - File not found check:
lib/Completion.groovy
-
files_exist - File not found check:
lib/Workflow.groovy
-
files_exist - File not found check:
.travis.yml
-
nextflow_config - Config variable found:
manifest.name
-
nextflow_config - Config variable found:
manifest.nextflowVersion
-
nextflow_config - Config variable found:
manifest.description
-
nextflow_config - Config variable found:
manifest.version
-
nextflow_config - Config variable found:
manifest.homePage
-
nextflow_config - Config variable found:
timeline.enabled
-
nextflow_config - Config variable found:
trace.enabled
-
nextflow_config - Config variable found:
report.enabled
-
nextflow_config - Config variable found:
dag.enabled
-
nextflow_config - Config variable found:
process.cpus
-
nextflow_config - Config variable found:
process.memory
-
nextflow_config - Config variable found:
process.time
-
nextflow_config - Config variable found:
params.outdir
-
nextflow_config - Config variable found:
params.input
-
nextflow_config - Config variable found:
params.show_hidden_params
-
nextflow_config - Config variable found:
params.schema_ignore_params
-
nextflow_config - Config variable found:
manifest.mainScript
-
nextflow_config - Config variable found:
timeline.file
-
nextflow_config - Config variable found:
trace.file
-
nextflow_config - Config variable found:
report.file
-
nextflow_config - Config variable found:
dag.file
-
nextflow_config - Config variable (correctly) not found:
params.version
-
nextflow_config - Config variable (correctly) not found:
params.nf_required_version
-
nextflow_config - Config variable (correctly) not found:
params.container
-
nextflow_config - Config variable (correctly) not found:
params.singleEnd
-
nextflow_config - Config variable (correctly) not found:
params.igenomesIgnore
-
nextflow_config - Config variable (correctly) not found:
params.name
-
nextflow_config - Config
timeline.enabled
had correct value:true
-
nextflow_config - Config
report.enabled
had correct value:true
-
nextflow_config - Config
trace.enabled
had correct value:true
-
nextflow_config - Config
dag.enabled
had correct value:true
-
nextflow_config - Config
manifest.name
began withnf-core/
-
nextflow_config - Config variable
manifest.homePage
began with https://github.com/nf-core/ -
nextflow_config - Config
dag.file
ended with.html
-
nextflow_config - Config variable
manifest.nextflowVersion
started with >= or !>= -
nextflow_config - Config
manifest.version
ends indev
:'2.1.1dev'
-
nextflow_config - Config
params.custom_config_version
is set tomaster
-
nextflow_config - Config
params.custom_config_base
is set tohttps://raw.githubusercontent.com/nf-core/configs/master
- nextflow_config - Lines for loading custom profiles found
-
files_unchanged -
.gitattributes
matches the template -
files_unchanged -
.prettierrc.yml
matches the template -
files_unchanged -
CODE_OF_CONDUCT.md
matches the template -
files_unchanged -
LICENSE
matches the template -
files_unchanged -
.github/.dockstore.yml
matches the template -
files_unchanged -
.github/CONTRIBUTING.md
matches the template -
files_unchanged -
.github/ISSUE_TEMPLATE/bug_report.yml
matches the template -
files_unchanged -
.github/ISSUE_TEMPLATE/config.yml
matches the template -
files_unchanged -
.github/ISSUE_TEMPLATE/feature_request.yml
matches the template -
files_unchanged -
.github/PULL_REQUEST_TEMPLATE.md
matches the template -
files_unchanged -
.github/workflows/branch.yml
matches the template -
files_unchanged -
.github/workflows/linting_comment.yml
matches the template -
files_unchanged -
.github/workflows/linting.yml
matches the template -
files_unchanged -
assets/email_template.html
matches the template -
files_unchanged -
assets/email_template.txt
matches the template -
files_unchanged -
assets/sendmail_template.txt
matches the template -
files_unchanged -
assets/nf-core-scrnaseq_logo_light.png
matches the template -
files_unchanged -
docs/images/nf-core-scrnaseq_logo_light.png
matches the template -
files_unchanged -
docs/images/nf-core-scrnaseq_logo_dark.png
matches the template -
files_unchanged -
docs/README.md
matches the template -
files_unchanged -
lib/nfcore_external_java_deps.jar
matches the template -
files_unchanged -
lib/NfcoreSchema.groovy
matches the template -
files_unchanged -
lib/NfcoreTemplate.groovy
matches the template -
files_unchanged -
.gitignore
matches the template -
files_unchanged -
.prettierignore
matches the template -
files_unchanged -
pyproject.toml
matches the template - actions_ci - '.github/workflows/ci.yml' is triggered on expected events
- actions_ci - '.github/workflows/ci.yml' checks minimum NF version
- actions_awstest - '.github/workflows/awstest.yml' is triggered correctly
-
actions_awsfulltest -
.github/workflows/awsfulltest.yml
is triggered correctly -
actions_awsfulltest -
.github/workflows/awsfulltest.yml
does not use-profile test
-
readme - README Nextflow minimum version badge matched config. Badge:
21.10.3
, Config:21.10.3
-
readme - README Nextflow minimum version in Quick Start section matched config. README:
21.10.3
, Config:21.10.3
- pipeline_todos - No TODO strings found
- pipeline_name_conventions - Name adheres to nf-core convention
- template_strings - Did not find any Jinja template strings (118 files)
- schema_lint - Schema lint passed
- schema_lint - Schema title + description lint passed
- schema_lint - Input mimetype lint passed: 'text/csv'
- schema_params - Schema matched params returned from nextflow config
- actions_schema_validation - Workflow validation passed: awsfulltest.yml
- actions_schema_validation - Workflow validation passed: ci.yml
- actions_schema_validation - Workflow validation passed: linting.yml
- actions_schema_validation - Workflow validation passed: linting_comment.yml
- actions_schema_validation - Workflow validation passed: fix-linting.yml
- actions_schema_validation - Workflow validation passed: branch.yml
- actions_schema_validation - Workflow validation passed: awstest.yml
- merge_markers - No merge markers found in pipeline files
-
modules_json - Only installed modules found in
modules.json
- multiqc_config - 'assets/multiqc_config.yml' follows the ordering scheme of the minimally required plugins.
- multiqc_config - 'assets/multiqc_config.yml' contains a matching 'report_comment'.
- multiqc_config - 'assets/multiqc_config.yml' contains 'export_plots: true'.
- modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'
Run details
- nf-core/tools version 2.6
- Run at
2022-11-17 14:11:35
Two questions:
- isn't this essentially throwing away sample information? I see you are only storing the cell index in the
barcodes.tsv
. - I'm afraid this will be quite slow when the merged dataset get's large (i.e. > 100k cells). This is one of the reasons to prefer binary formats like
h5
over text representation. Maybe not run this by default, or at least make it easy to switch off.
Regarding 1, the barcodes file contains the actual barcodes list, it looks like this:
AAACCTGAGCTAGTGG
AAACCTGCATCGACGC
...
Concerning the second point, do you know which part is the one making it slow? I agree to not run it by default. I'll add a new parameter.
Regarding 1, the barcodes file contains the actual barcodes list, it looks like this
yes, so how will you actually know which barcode comes from which biological sample?
Concerning the second point, do you know which part is the one making it slow? I agree to not run it by default. I'll add a new parameter.
Just writing a text file vs. writing a binary file. Tbh, I don't really have experience with writing these files, but reading feat/barcode/mtx
is way slower than reading the h5 files produced by cellranger.
Regarding 1, the barcodes file contains the actual barcodes list, it looks like this
yes, so how will you actually know which barcode comes from which biological sample?
Because those 3 files will be generated on a per-sample basis. This is the folder structure:
├── Sample_X_matrix
│ ├── barcodes.tsv
│ ├── features.tsv
│ └── matrix.mtx
└── Sample_Y_matrix
├── barcodes.tsv
├── features.tsv
└── matrix.mtx
Concerning the second point, do you know which part is the one making it slow? I agree to not run it by default. I'll add a new parameter.
Just writing a text file vs. writing a binary file. Tbh, I don't really have experience with writing these files, but reading
feat/barcode/mtx
is way slower than reading the h5 files produced by cellranger.
Yeah, unfortunately for some integrations we need those files in text format.
I have added the parameter --output_10x
and it's disabled by default.
As mentioned on slack, no new functionality should be needed to produce these files - Maybe some publishDir
options need to be tuned to get them into the right places in the output directory.
All the aligners produce mtx files already and we use that python script to build h5ad files from them.In the example output of the workflow:
- cellranger: https://nf-co.re/scrnaseq/results#scrnaseq/results-c86646e4a818397f4bddfffd641b34240423f3[…]ger/count/sample-pbmc8k/outs/filtered_feature_bc_matrix/
- kallisto: https://nf-co.re/scrnaseq/results#scrnaseq/results-c86646e4a818397f4bddfffd641b34240423f3[…]ligner_kallisto/kallisto/pbmc8k.count/counts_unfiltered/
- starsolo: https://nf-co.re/scrnaseq/results#scrnaseq/results-c86646e4a818397f4bddfffd641b34240423f3ea/aligner_star/star/pbmc8k/pbmc8k.Solo.out/Gene/filtered/
- I didn't find alevin, but that's probably just a missing publishDir in the config.
I got your point. However, I have checked those files and there are two problems with them.
- The matrix is transposed so it will failed when trying to load them in R
- The genes files contains the ensemble IDs but we want that file to include also the gene names (which we are doing in our code).
My goal is to format correctly those files independently of the aligner so we can add another step to upload them automatically for downstream analysis. I am wary of adding those "R formatting" options into the actual aligners' steps to avoid polluting them with formatting. How would you do that?
Actually, maybe this could be simplified and at the same time used to solve https://github.com/nf-core/scrnaseq/issues/159:
- Add gene symbols already in anndata (https://github.com/nf-core/scrnaseq/issues/159)
- Add an additional flag
--export_mtx
to @fmalmeida'smtx_to_h5ad.py
script, to enable mtx export on demand. - Output mtx directly from that python script. Like that we can profit from the file already being loaded into memory and there is very little complexity added (no more processes)
At the same time we should think about how we organize the output directory. We currently have the mtx_conversions
output directory for each aligner with the h5ad
and rds
files in them. This is fine, but I think the name is not ideal and it needs some substructure, e.g.
aligner_cellranger/matrices
|
| - per_sample
| |
| - h5ad
| - mtx
| - seurat
- merged
|
- h5ad
- seurat
@apeltzer, @fmalmeida, @kafkasl, what do you think?
Well, I agree that having it standard so they can used in the same way afterwards is a good idea. I gues, for example, instead of adding this export
on the pyScript, we can actaully add a step before, which does this standardization on files, and "saves" it and then the other conversions come from these standardised matrices and we work on re-shaping the conversion modules and scripts to use that new (standard) formatting.
Maybe this makes more sense, no?
At the same time we should think about how we organize the output directory. We currently have the mtx_conversions output directory for each aligner with the h5ad and rds files in them. This is fine, but I think the name is not ideal and it needs some substructure, e.g.
On this second comment, I totally agree that we should reshape it and I have no comments on it. I liked the structure proposed.
export on the pyScript, we can actaully add a step before, which does this standardization on files, and "saves" it and then the other conversions come from these standardised matrices and we work on re-shaping the conversion modules and scripts to use that new (standard) formatting.
It depends a bit what needs to be done. Reading/writing mtx files in a Python script is way slower than h5(ad) files. Not reading (i.e. using the data already in memory) is even better. So purely from a runtime perspective, it is beneficial to read whatever output files the aligners create once and then write all desired outputs.
That being said, transposing a mtx matrix could probably done on the command line using awk, which would be very fast.
I agree that we should not read often but once and output what is necessary / standardize this a bit. The conversion modules also need to add versions for example, so would be great to do all of the above to make sure we're also following best practices 👍
If that even closes more issues even better 🥳
I think what you mentioned @grst sounds good. We have an idea of how to add the gene symbols and the --export_mtx
parameter. We will have to look into how to change the output structure. I don't know what @apeltzer refers to with "adding versions".
The current mxt_...
modules do not output the required versions.yml
which should be done too 👍🏻 Thats something that was missed but is also highlighted by the nf-core lint
tool, so we should adhere to the standard of generating this when the conversion is done :-)
@grst @apeltzer I've modified the PR to do the following:
- Standardize the
mtx_conversions
outputs. They now look like this:
├── mtx_conversions
│ ├── Sample_X
│ │ ├── Sample_X_matrix.h5ad
│ │ ├── Sample_X_matrix.rds
│ │ ├── barcodes.tsv
│ │ ├── features.tsv
│ │ └── matrix.mtx
│ ├── Sample_Y
│ │ ├── Sample_Y_matrix.h5ad
│ │ ├── Sample_Y_matrix.rds
│ │ ├── barcodes.tsv
│ │ ├── features.tsv
│ │ └── matrix.mtx
- Export the 10x counts format files,
barcodes.tsv
,features.tsv
, andmatrix.mtx
. This has been done inside an existing module from data already in memory so there shouldn't be any performance impact other than writing to disk. - I've enriched the
features.tsv
files with the gene names, by default they only have the gene IDs. I extracted them from txp2gene for kallisto, and thegeneInfo.tab
file for star & cellranger. For alevin, we haven't managed to find where to get that translation info so far. - I added the export the 10x counts param
--export_mtx
that you suggested but setting it to false breaks the downstream processmtx_to_seurat
which depends on this matrix counts being exported so I think it should not be added.
as a side note, the prettier
command and nf-core schema build
have conflicting formatting for nextflow_schema.json
@grst I've addressed all your comments. The most important ones:
- mtx_to_h5ad now exports a versions file
-
cellrange_mtx_to_h5ad
has been deleted and now uses alsomtx_to_h5ad
- improved the parsing of t2g-like files with your suggestions