sarek
sarek copied to clipboard
DRAFT: Concatenating vcfs
This is a DRAFT PR for #738.
Still lots more to be done and things that needs to be discussed.
So far I am just concatenating the germline-vcfs from haplotypecaller and strelka, and placing the resulting vcf <patient>.germline.vcf.gz
in the results-folder results/variant_calling/concat/<patient>
.
@maxime doesn't want the concatenation to be optional.
I've set it up so that Sarek puts the concatenated .vcf.gz
-file here:
results/variant_calling/concat/<patient>/<patient>.germline.vcf.gz
Should there also be a .tbi
-file for the vcf-file?
PR checklist
- [ ] This comment contains a description of changes (with reason).
- [ ] If you've fixed a bug or added code that should be tested, add tests!
- [ ] If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/sarek branch on the nf-core/test-datasets repository.
- [ ] Make sure your code lints (
nf-core lint
). - [ ] Ensure the test suite passes (
nextflow run . -profile test,docker --outdir <OUTDIR>
). - [ ] Usage Documentation in
docs/usage.md
is updated. - [ ] Output Documentation in
docs/output.md
is updated. - [ ]
CHANGELOG.md
is updated. - [ ]
README.md
is updated (including new tool citations and authors/contributors).
nf-core lint
overall result: Passed :white_check_mark: :warning:
Posted for pipeline commit b32b4cf
+| ✅ 151 tests passed |+
#| ❔ 8 tests were ignored |#
!| ❗ 2 tests had warnings |!
:heavy_exclamation_mark: Test warnings:
-
pipeline_todos - TODO string in
methods_description_template.yml
: #Update the HTML below to your prefered methods description, e.g. add publication citation for this pipeline -
schema_description - No description provided in schema for parameter:
cnvkit_reference
:grey_question: Tests ignored:
-
files_exist - File is ignored:
conf/modules.config
-
files_exist - File is ignored:
conf/test.config
-
files_exist - File is ignored:
conf/test_full.config
-
files_unchanged - File ignored due to lint config:
assets/nf-core-sarek_logo_light.png
-
files_unchanged - File ignored due to lint config:
docs/images/nf-core-sarek_logo_light.png
-
files_unchanged - File ignored due to lint config:
docs/images/nf-core-sarek_logo_dark.png
-
files_unchanged - File ignored due to lint config:
lib/NfcoreTemplate.groovy
- template_strings - template_strings
:white_check_mark: Tests passed:
-
files_exist - File found:
.gitattributes
-
files_exist - File found:
.gitignore
-
files_exist - File found:
.nf-core.yml
-
files_exist - File found:
.editorconfig
-
files_exist - File found:
.prettierignore
-
files_exist - File found:
.prettierrc.yml
-
files_exist - File found:
CHANGELOG.md
-
files_exist - File found:
CITATIONS.md
-
files_exist - File found:
CODE_OF_CONDUCT.md
-
files_exist - File found:
CODE_OF_CONDUCT.md
-
files_exist - File found:
LICENSE
orLICENSE.md
orLICENCE
orLICENCE.md
-
files_exist - File found:
nextflow_schema.json
-
files_exist - File found:
nextflow.config
-
files_exist - File found:
README.md
-
files_exist - File found:
.github/.dockstore.yml
-
files_exist - File found:
.github/CONTRIBUTING.md
-
files_exist - File found:
.github/ISSUE_TEMPLATE/bug_report.yml
-
files_exist - File found:
.github/ISSUE_TEMPLATE/config.yml
-
files_exist - File found:
.github/ISSUE_TEMPLATE/feature_request.yml
-
files_exist - File found:
.github/PULL_REQUEST_TEMPLATE.md
-
files_exist - File found:
.github/workflows/branch.yml
-
files_exist - File found:
.github/workflows/ci.yml
-
files_exist - File found:
.github/workflows/linting_comment.yml
-
files_exist - File found:
.github/workflows/linting.yml
-
files_exist - File found:
assets/email_template.html
-
files_exist - File found:
assets/email_template.txt
-
files_exist - File found:
assets/sendmail_template.txt
-
files_exist - File found:
assets/nf-core-sarek_logo_light.png
-
files_exist - File found:
docs/images/nf-core-sarek_logo_light.png
-
files_exist - File found:
docs/images/nf-core-sarek_logo_dark.png
-
files_exist - File found:
docs/output.md
-
files_exist - File found:
docs/README.md
-
files_exist - File found:
docs/README.md
-
files_exist - File found:
docs/usage.md
-
files_exist - File found:
lib/nfcore_external_java_deps.jar
-
files_exist - File found:
lib/NfcoreSchema.groovy
-
files_exist - File found:
lib/NfcoreTemplate.groovy
-
files_exist - File found:
lib/Utils.groovy
-
files_exist - File found:
lib/WorkflowMain.groovy
-
files_exist - File found:
main.nf
-
files_exist - File found:
assets/multiqc_config.yml
-
files_exist - File found:
conf/base.config
-
files_exist - File found:
conf/igenomes.config
-
files_exist - File found:
.github/workflows/awstest.yml
-
files_exist - File found:
.github/workflows/awsfulltest.yml
-
files_exist - File found:
lib/WorkflowSarek.groovy
-
files_exist - File found:
modules.json
-
files_exist - File found:
pyproject.toml
-
files_exist - File not found check:
Singularity
-
files_exist - File not found check:
parameters.settings.json
-
files_exist - File not found check:
.nf-core.yaml
-
files_exist - File not found check:
bin/markdown_to_html.r
-
files_exist - File not found check:
conf/aws.config
-
files_exist - File not found check:
.github/workflows/push_dockerhub.yml
-
files_exist - File not found check:
.github/ISSUE_TEMPLATE/bug_report.md
-
files_exist - File not found check:
.github/ISSUE_TEMPLATE/feature_request.md
-
files_exist - File not found check:
docs/images/nf-core-sarek_logo.png
-
files_exist - File not found check:
.markdownlint.yml
-
files_exist - File not found check:
.yamllint.yml
-
files_exist - File not found check:
lib/Checks.groovy
-
files_exist - File not found check:
lib/Completion.groovy
-
files_exist - File not found check:
lib/Workflow.groovy
-
files_exist - File not found check:
.travis.yml
-
nextflow_config - Config variable found:
manifest.name
-
nextflow_config - Config variable found:
manifest.nextflowVersion
-
nextflow_config - Config variable found:
manifest.description
-
nextflow_config - Config variable found:
manifest.version
-
nextflow_config - Config variable found:
manifest.homePage
-
nextflow_config - Config variable found:
timeline.enabled
-
nextflow_config - Config variable found:
trace.enabled
-
nextflow_config - Config variable found:
report.enabled
-
nextflow_config - Config variable found:
dag.enabled
-
nextflow_config - Config variable found:
process.cpus
-
nextflow_config - Config variable found:
process.memory
-
nextflow_config - Config variable found:
process.time
-
nextflow_config - Config variable found:
params.outdir
-
nextflow_config - Config variable found:
params.input
-
nextflow_config - Config variable found:
params.show_hidden_params
-
nextflow_config - Config variable found:
params.schema_ignore_params
-
nextflow_config - Config variable found:
manifest.mainScript
-
nextflow_config - Config variable found:
timeline.file
-
nextflow_config - Config variable found:
trace.file
-
nextflow_config - Config variable found:
report.file
-
nextflow_config - Config variable found:
dag.file
-
nextflow_config - Config variable (correctly) not found:
params.version
-
nextflow_config - Config variable (correctly) not found:
params.nf_required_version
-
nextflow_config - Config variable (correctly) not found:
params.container
-
nextflow_config - Config variable (correctly) not found:
params.singleEnd
-
nextflow_config - Config variable (correctly) not found:
params.igenomesIgnore
-
nextflow_config - Config variable (correctly) not found:
params.name
-
nextflow_config - Config
timeline.enabled
had correct value:true
-
nextflow_config - Config
report.enabled
had correct value:true
-
nextflow_config - Config
trace.enabled
had correct value:true
-
nextflow_config - Config
dag.enabled
had correct value:true
-
nextflow_config - Config
manifest.name
began withnf-core/
-
nextflow_config - Config variable
manifest.homePage
began with https://github.com/nf-core/ -
nextflow_config - Config
dag.file
ended with.html
-
nextflow_config - Config variable
manifest.nextflowVersion
started with >= or !>= -
nextflow_config - Config
manifest.version
ends indev
:'3.2dev'
-
nextflow_config - Config
params.custom_config_version
is set tomaster
-
nextflow_config - Config
params.custom_config_base
is set tohttps://raw.githubusercontent.com/nf-core/configs/master
- nextflow_config - Lines for loading custom profiles found
-
files_unchanged -
.gitattributes
matches the template -
files_unchanged -
.prettierrc.yml
matches the template -
files_unchanged -
CODE_OF_CONDUCT.md
matches the template -
files_unchanged -
LICENSE
matches the template -
files_unchanged -
.github/.dockstore.yml
matches the template -
files_unchanged -
.github/CONTRIBUTING.md
matches the template -
files_unchanged -
.github/ISSUE_TEMPLATE/bug_report.yml
matches the template -
files_unchanged -
.github/ISSUE_TEMPLATE/config.yml
matches the template -
files_unchanged -
.github/ISSUE_TEMPLATE/feature_request.yml
matches the template -
files_unchanged -
.github/PULL_REQUEST_TEMPLATE.md
matches the template -
files_unchanged -
.github/workflows/branch.yml
matches the template -
files_unchanged -
.github/workflows/linting_comment.yml
matches the template -
files_unchanged -
.github/workflows/linting.yml
matches the template -
files_unchanged -
assets/email_template.html
matches the template -
files_unchanged -
assets/email_template.txt
matches the template -
files_unchanged -
assets/sendmail_template.txt
matches the template -
files_unchanged -
docs/README.md
matches the template -
files_unchanged -
lib/nfcore_external_java_deps.jar
matches the template -
files_unchanged -
lib/NfcoreSchema.groovy
matches the template -
files_unchanged -
.gitignore
matches the template -
files_unchanged -
.prettierignore
matches the template -
files_unchanged -
pyproject.toml
matches the template - actions_ci - '.github/workflows/ci.yml' is triggered on expected events
- actions_ci - '.github/workflows/ci.yml' checks minimum NF version
- actions_awstest - '.github/workflows/awstest.yml' is triggered correctly
-
actions_awsfulltest -
.github/workflows/awsfulltest.yml
is triggered correctly -
actions_awsfulltest -
.github/workflows/awsfulltest.yml
does not use-profile test
-
readme - README Nextflow minimum version badge matched config. Badge:
21.10.3
, Config:21.10.3
-
readme - README Nextflow minimum version in Quick Start section matched config. README:
21.10.3
, Config:21.10.3
- pipeline_name_conventions - Name adheres to nf-core convention
- schema_lint - Schema lint passed
- schema_lint - Schema title + description lint passed
- schema_lint - Input mimetype lint passed: 'text/csv'
- schema_params - Schema matched params returned from nextflow config
- actions_schema_validation - Workflow validation passed: pytest-workflow.yml
- actions_schema_validation - Workflow validation passed: ci.yml
- actions_schema_validation - Workflow validation passed: fix-linting.yml
- actions_schema_validation - Workflow validation passed: awsfulltest.yml
- actions_schema_validation - Workflow validation passed: awstest.yml
- actions_schema_validation - Workflow validation passed: awsfulltest_germline.yml
- actions_schema_validation - Workflow validation passed: branch.yml
- actions_schema_validation - Workflow validation passed: linting_comment.yml
- actions_schema_validation - Workflow validation passed: linting.yml
- merge_markers - No merge markers found in pipeline files
-
modules_json - Only installed modules found in
modules.json
- multiqc_config - 'assets/multiqc_config.yml' follows the ordering scheme of the minimally required plugins.
- multiqc_config - 'assets/multiqc_config.yml' contains a matching 'report_comment'.
- multiqc_config - 'assets/multiqc_config.yml' contains 'export_plots: true'.
- modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'
Run details
- nf-core/tools version 2.6
- Run at
2022-12-06 19:06:23
So far I am just concatenating the germline-vcfs from haplotypecaller and strelka, and placing the resulting vcf
<patient>.germline.vcf.gz
in the results-folderresults/variant_calling/concat/<patient>
.
I think it's best to start small, and just do germline snps/indels for now.
@maxime doesn't want the concatenation to be optional.
I do think it's better to have that optional, people can have different usage downstream.
I've set it up so that Sarek puts the concatenated
.vcf.gz
-file here:results/variant_calling/concat/<patient>/<patient>.germline.vcf.gz
Should there also be a
.tbi
-file for the vcf-file?
Yes, in my opinion, as long as we produce a vcf.gz, we should have it tabix indexed.
Can we create a results/variant_calling/concat/<patient>/<patient>.germline.txt
to list all vcf that were concatenated to produce this file, or do we have that info in the final vcf?
So far I am just concatenating the germline-vcfs from haplotypecaller and strelka, and placing the resulting vcf
<patient>.germline.vcf.gz
in the results-folderresults/variant_calling/concat/<patient>
.I think it's best to start small, and just do germline snps/indels for now.
@maxime doesn't want the concatenation to be optional.
I do think it's better to have that optional, people can have different usage downstream.
I've set it up so that Sarek puts the concatenated
.vcf.gz
-file here:results/variant_calling/concat/<patient>/<patient>.germline.vcf.gz
Should there also be a
.tbi
-file for the vcf-file?Yes, in my opinion, as long as we produce a vcf.gz, we should have it tabix indexed. Can we create a
results/variant_calling/concat/<patient>/<patient>.germline.txt
to list all vcf that were concatenated to produce this file, or do we have that info in the final vcf?
Thanks for the feedback, @maxulysse. Much appreciated. I'll make the concatenation optional somehow :-)
Concerning your idea about the text-file - the vcf-file produced by bcftools concat
already contains information about which vcf-files where concatenated:
##bcftools_concatCommand=concat --output test1.germline.vcf.gz --threads 1 test1.strelka.variants.vcf.gz test1.manta.diploid_sv.vcf.gz test1.haplotypecaller.filtered.vcf.gz; Date=Thu Nov 10 21:40:33 2022
I'd say that makes the text-file redundant, right?
I'd say that makes the text-file redundant, right? yes, that's enough for me indeed
@FriederikeHanssen @maxulysse : Can I get you guys to do a preliminary review of this PR?
If this PR looks okay, then I'll update the corresponding modules in github.com/nf-core/modules.
I've tested this PR with the following cmd:
nextflow run main.nf -profile test,singularity --input mapped_joint_bam.fixed.csv -dump-channels -ansi-log false --step variant_calling --concatenate_vcfs --tools cnvkit,deepvariant,freebayes,haplotypecaller,manta,mpileup,strelka,tiddit
and it gives me a concatenated germline-vcf-file which was made by this bcftools concat
- command:
##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.bcftools.vcf.gz testN.tiddit.vcf.gz testN.deepvariant.vcf.gz testN.freebayes.vcf.gz testN.manta.diploid_sv.vcf.gz testN.strelka.variants.vcf.gz testN.haplotypecaller.filtered.vcf.gz; Date=Tue Nov 29 10:47:08 2022
(N.B. The cnvkit doesn't produce a vcf-file, so no variants from cnvkit in the concatenated vcf-file.)
In fact, two concatenated vcf-files were produced, since the input-samplesheet contains to bam-files:
results/variant_calling/concat/testN/testN.germline.vcf.gz
results/variant_calling/concat/testT/testT.germline.vcf.gz
The vcf-files are sorted and have corresponding tbi-files.
Warning: This PR contains some real clumsy code: https://github.com/asp8200/sarek/blob/f8edc0034b9f01e3644ae75d7eaf57449581659c/workflows/sarek.nf#L1048-L1060
Do we want to annotate these vcfs on plus the regular ones or instead?
Do we want to annotate these vcfs on plus the regular ones or instead?
I don't know. Right now I'm concatenating the un-annotated, germline vcf-files, and the resulting vcf-file is not getting annotated. I guess ideally the user could decided for himself if he wants the un-annnotated and/or annotated vcf-files concatenated.
Let's annotate as we are doing for now, we'll add annotating concatenated vcfs in a future PR.
Let's annotate as we are doing for now, we'll add annotating concatenated vcfs in a future PR.
Yeah, I'd also say let's just keep this as simple as possible for now, and then later on evaluate if we want to go further with this kind of post-processing of the vcf-files. It may be the case that this kind of post-processing of vcf-files is best left to the users themselves as different users may have very different requirements 🤔
Let's annotate as we are doing for now, we'll add annotating concatenated vcfs in a future PR.
Yeah, I'd also say let's just keep this as simple as possible for now, and then later on evaluate if we want to go further with this kind of post-processing of the vcf-files. It may be the case that this kind of post-processing of vcf-files is best left to the users themselves as different users may have very different requirements thinking
made an issue to keep track of ideas: #878
Damn! All the hard work I did with getting the variant-callers to return index-files all the way back to sarek.nf
seems to be redundant, as I'll have to compute new index files after adding the INFO-field ~~SET~~ SOURCE to the vcf-files. The SOURCE-field will contain the name of the file from whense the variant came.
Anyways, this is how it will look:
chr22 3420 . C G 9.1759e-05 . AB=0.2;....;TYPE=snp;technology.illumina=1;SOURCE=testT.freebayes.vcf.gz GT:DP:AD:RO:...
Damn! All the hard work I did with getting the variant-callers to return index-files all the way back to
sarek.nf
seems to be redundant, as I'll have to compute new index files after adding the INFO-field ~SET~ SOURCE to the vcf-files.
🙈 oh no
Ok, so I introduced a local module for adding the INFO
-field SOURCE=<name-of-input-vcf-file>
. Here is the concatenated vcf-file
With the CLI-options --concatenate_vcfs
germline-vcf-files from the following variant-callers will be concatenated:
deepvariant
freebayes
haplotypecaller
manta
mpileup
strelka
tiddit
In the attached concatenated vcf-files, there are no variant from manta or tiddit.
What do you guys think about this solution? I'm still passing the index-files from the variant-caller-modules all the way back to sarek.nf
; that is actually not necessary with the usage of the local module. Should I get rid of the code passing the index-files from the variant-caller-modules back to sarek.nf
? 🤔
I'm still passing the index-files from the variant-caller-modules all the way back to sarek.nf; that is actually not necessary with the usage of the local module. Should I get rid of the code passing the index-files from the variant-caller-modules back to sarek.nf?
The fastest and easiest solution would be just to get rid of the (new) code which is passing the index-files back to sarek.nf
, since then I don't have to update anything in nf-core/modules
😁
I'm still passing the index-files from the variant-caller-modules all the way back to sarek.nf; that is actually not necessary with the usage of the local module. Should I get rid of the code passing the index-files from the variant-caller-modules back to sarek.nf?
The fastest and easiest solution would be just to get rid of the (new) code which is passing the index-files back to
sarek.nf
, since then I don't have to update anything innf-core/modules
😁
@maxulysse asked me to get rid of the redundant code, and so I did.
I now - finally - have all CI-tests passing. Let's merge this thing!
uh nice 🥳 🚀