sarek
sarek copied to clipboard
DRAFT: Concatenating vcfs
This is a DRAFT PR for #738.
Still lots more to be done and things that needs to be discussed.
So far I am just concatenating the germline-vcfs from haplotypecaller and strelka, and placing the resulting vcf <patient>.germline.vcf.gz in the results-folder results/variant_calling/concat/<patient>.
@maxime doesn't want the concatenation to be optional.
I've set it up so that Sarek puts the concatenated .vcf.gz-file here:
results/variant_calling/concat/<patient>/<patient>.germline.vcf.gz
Should there also be a .tbi-file for the vcf-file?
PR checklist
- [ ] This comment contains a description of changes (with reason).
- [ ] If you've fixed a bug or added code that should be tested, add tests!
- [ ] If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/sarek branch on the nf-core/test-datasets repository.
- [ ] Make sure your code lints (
nf-core lint). - [ ] Ensure the test suite passes (
nextflow run . -profile test,docker --outdir <OUTDIR>). - [ ] Usage Documentation in
docs/usage.mdis updated. - [ ] Output Documentation in
docs/output.mdis updated. - [ ]
CHANGELOG.mdis updated. - [ ]
README.mdis updated (including new tool citations and authors/contributors).
nf-core lint overall result: Passed :white_check_mark: :warning:
Posted for pipeline commit b32b4cf
+| ✅ 151 tests passed |+
#| ❔ 8 tests were ignored |#
!| ❗ 2 tests had warnings |!
:heavy_exclamation_mark: Test warnings:
- pipeline_todos - TODO string in
methods_description_template.yml: #Update the HTML below to your prefered methods description, e.g. add publication citation for this pipeline - schema_description - No description provided in schema for parameter:
cnvkit_reference
:grey_question: Tests ignored:
- files_exist - File is ignored:
conf/modules.config - files_exist - File is ignored:
conf/test.config - files_exist - File is ignored:
conf/test_full.config - files_unchanged - File ignored due to lint config:
assets/nf-core-sarek_logo_light.png - files_unchanged - File ignored due to lint config:
docs/images/nf-core-sarek_logo_light.png - files_unchanged - File ignored due to lint config:
docs/images/nf-core-sarek_logo_dark.png - files_unchanged - File ignored due to lint config:
lib/NfcoreTemplate.groovy - template_strings - template_strings
:white_check_mark: Tests passed:
- files_exist - File found:
.gitattributes - files_exist - File found:
.gitignore - files_exist - File found:
.nf-core.yml - files_exist - File found:
.editorconfig - files_exist - File found:
.prettierignore - files_exist - File found:
.prettierrc.yml - files_exist - File found:
CHANGELOG.md - files_exist - File found:
CITATIONS.md - files_exist - File found:
CODE_OF_CONDUCT.md - files_exist - File found:
CODE_OF_CONDUCT.md - files_exist - File found:
LICENSEorLICENSE.mdorLICENCEorLICENCE.md - files_exist - File found:
nextflow_schema.json - files_exist - File found:
nextflow.config - files_exist - File found:
README.md - files_exist - File found:
.github/.dockstore.yml - files_exist - File found:
.github/CONTRIBUTING.md - files_exist - File found:
.github/ISSUE_TEMPLATE/bug_report.yml - files_exist - File found:
.github/ISSUE_TEMPLATE/config.yml - files_exist - File found:
.github/ISSUE_TEMPLATE/feature_request.yml - files_exist - File found:
.github/PULL_REQUEST_TEMPLATE.md - files_exist - File found:
.github/workflows/branch.yml - files_exist - File found:
.github/workflows/ci.yml - files_exist - File found:
.github/workflows/linting_comment.yml - files_exist - File found:
.github/workflows/linting.yml - files_exist - File found:
assets/email_template.html - files_exist - File found:
assets/email_template.txt - files_exist - File found:
assets/sendmail_template.txt - files_exist - File found:
assets/nf-core-sarek_logo_light.png - files_exist - File found:
docs/images/nf-core-sarek_logo_light.png - files_exist - File found:
docs/images/nf-core-sarek_logo_dark.png - files_exist - File found:
docs/output.md - files_exist - File found:
docs/README.md - files_exist - File found:
docs/README.md - files_exist - File found:
docs/usage.md - files_exist - File found:
lib/nfcore_external_java_deps.jar - files_exist - File found:
lib/NfcoreSchema.groovy - files_exist - File found:
lib/NfcoreTemplate.groovy - files_exist - File found:
lib/Utils.groovy - files_exist - File found:
lib/WorkflowMain.groovy - files_exist - File found:
main.nf - files_exist - File found:
assets/multiqc_config.yml - files_exist - File found:
conf/base.config - files_exist - File found:
conf/igenomes.config - files_exist - File found:
.github/workflows/awstest.yml - files_exist - File found:
.github/workflows/awsfulltest.yml - files_exist - File found:
lib/WorkflowSarek.groovy - files_exist - File found:
modules.json - files_exist - File found:
pyproject.toml - files_exist - File not found check:
Singularity - files_exist - File not found check:
parameters.settings.json - files_exist - File not found check:
.nf-core.yaml - files_exist - File not found check:
bin/markdown_to_html.r - files_exist - File not found check:
conf/aws.config - files_exist - File not found check:
.github/workflows/push_dockerhub.yml - files_exist - File not found check:
.github/ISSUE_TEMPLATE/bug_report.md - files_exist - File not found check:
.github/ISSUE_TEMPLATE/feature_request.md - files_exist - File not found check:
docs/images/nf-core-sarek_logo.png - files_exist - File not found check:
.markdownlint.yml - files_exist - File not found check:
.yamllint.yml - files_exist - File not found check:
lib/Checks.groovy - files_exist - File not found check:
lib/Completion.groovy - files_exist - File not found check:
lib/Workflow.groovy - files_exist - File not found check:
.travis.yml - nextflow_config - Config variable found:
manifest.name - nextflow_config - Config variable found:
manifest.nextflowVersion - nextflow_config - Config variable found:
manifest.description - nextflow_config - Config variable found:
manifest.version - nextflow_config - Config variable found:
manifest.homePage - nextflow_config - Config variable found:
timeline.enabled - nextflow_config - Config variable found:
trace.enabled - nextflow_config - Config variable found:
report.enabled - nextflow_config - Config variable found:
dag.enabled - nextflow_config - Config variable found:
process.cpus - nextflow_config - Config variable found:
process.memory - nextflow_config - Config variable found:
process.time - nextflow_config - Config variable found:
params.outdir - nextflow_config - Config variable found:
params.input - nextflow_config - Config variable found:
params.show_hidden_params - nextflow_config - Config variable found:
params.schema_ignore_params - nextflow_config - Config variable found:
manifest.mainScript - nextflow_config - Config variable found:
timeline.file - nextflow_config - Config variable found:
trace.file - nextflow_config - Config variable found:
report.file - nextflow_config - Config variable found:
dag.file - nextflow_config - Config variable (correctly) not found:
params.version - nextflow_config - Config variable (correctly) not found:
params.nf_required_version - nextflow_config - Config variable (correctly) not found:
params.container - nextflow_config - Config variable (correctly) not found:
params.singleEnd - nextflow_config - Config variable (correctly) not found:
params.igenomesIgnore - nextflow_config - Config variable (correctly) not found:
params.name - nextflow_config - Config
timeline.enabledhad correct value:true - nextflow_config - Config
report.enabledhad correct value:true - nextflow_config - Config
trace.enabledhad correct value:true - nextflow_config - Config
dag.enabledhad correct value:true - nextflow_config - Config
manifest.namebegan withnf-core/ - nextflow_config - Config variable
manifest.homePagebegan with https://github.com/nf-core/ - nextflow_config - Config
dag.fileended with.html - nextflow_config - Config variable
manifest.nextflowVersionstarted with >= or !>= - nextflow_config - Config
manifest.versionends indev:'3.2dev' - nextflow_config - Config
params.custom_config_versionis set tomaster - nextflow_config - Config
params.custom_config_baseis set tohttps://raw.githubusercontent.com/nf-core/configs/master - nextflow_config - Lines for loading custom profiles found
- files_unchanged -
.gitattributesmatches the template - files_unchanged -
.prettierrc.ymlmatches the template - files_unchanged -
CODE_OF_CONDUCT.mdmatches the template - files_unchanged -
LICENSEmatches the template - files_unchanged -
.github/.dockstore.ymlmatches the template - files_unchanged -
.github/CONTRIBUTING.mdmatches the template - files_unchanged -
.github/ISSUE_TEMPLATE/bug_report.ymlmatches the template - files_unchanged -
.github/ISSUE_TEMPLATE/config.ymlmatches the template - files_unchanged -
.github/ISSUE_TEMPLATE/feature_request.ymlmatches the template - files_unchanged -
.github/PULL_REQUEST_TEMPLATE.mdmatches the template - files_unchanged -
.github/workflows/branch.ymlmatches the template - files_unchanged -
.github/workflows/linting_comment.ymlmatches the template - files_unchanged -
.github/workflows/linting.ymlmatches the template - files_unchanged -
assets/email_template.htmlmatches the template - files_unchanged -
assets/email_template.txtmatches the template - files_unchanged -
assets/sendmail_template.txtmatches the template - files_unchanged -
docs/README.mdmatches the template - files_unchanged -
lib/nfcore_external_java_deps.jarmatches the template - files_unchanged -
lib/NfcoreSchema.groovymatches the template - files_unchanged -
.gitignorematches the template - files_unchanged -
.prettierignorematches the template - files_unchanged -
pyproject.tomlmatches the template - actions_ci - '.github/workflows/ci.yml' is triggered on expected events
- actions_ci - '.github/workflows/ci.yml' checks minimum NF version
- actions_awstest - '.github/workflows/awstest.yml' is triggered correctly
- actions_awsfulltest -
.github/workflows/awsfulltest.ymlis triggered correctly - actions_awsfulltest -
.github/workflows/awsfulltest.ymldoes not use-profile test - readme - README Nextflow minimum version badge matched config. Badge:
21.10.3, Config:21.10.3 - readme - README Nextflow minimum version in Quick Start section matched config. README:
21.10.3, Config:21.10.3 - pipeline_name_conventions - Name adheres to nf-core convention
- schema_lint - Schema lint passed
- schema_lint - Schema title + description lint passed
- schema_lint - Input mimetype lint passed: 'text/csv'
- schema_params - Schema matched params returned from nextflow config
- actions_schema_validation - Workflow validation passed: pytest-workflow.yml
- actions_schema_validation - Workflow validation passed: ci.yml
- actions_schema_validation - Workflow validation passed: fix-linting.yml
- actions_schema_validation - Workflow validation passed: awsfulltest.yml
- actions_schema_validation - Workflow validation passed: awstest.yml
- actions_schema_validation - Workflow validation passed: awsfulltest_germline.yml
- actions_schema_validation - Workflow validation passed: branch.yml
- actions_schema_validation - Workflow validation passed: linting_comment.yml
- actions_schema_validation - Workflow validation passed: linting.yml
- merge_markers - No merge markers found in pipeline files
- modules_json - Only installed modules found in
modules.json - multiqc_config - 'assets/multiqc_config.yml' follows the ordering scheme of the minimally required plugins.
- multiqc_config - 'assets/multiqc_config.yml' contains a matching 'report_comment'.
- multiqc_config - 'assets/multiqc_config.yml' contains 'export_plots: true'.
- modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'
Run details
- nf-core/tools version 2.6
- Run at
2022-12-06 19:06:23
So far I am just concatenating the germline-vcfs from haplotypecaller and strelka, and placing the resulting vcf
<patient>.germline.vcf.gzin the results-folderresults/variant_calling/concat/<patient>.
I think it's best to start small, and just do germline snps/indels for now.
@maxime doesn't want the concatenation to be optional.
I do think it's better to have that optional, people can have different usage downstream.
I've set it up so that Sarek puts the concatenated
.vcf.gz-file here:results/variant_calling/concat/<patient>/<patient>.germline.vcf.gzShould there also be a
.tbi-file for the vcf-file?
Yes, in my opinion, as long as we produce a vcf.gz, we should have it tabix indexed.
Can we create a results/variant_calling/concat/<patient>/<patient>.germline.txt to list all vcf that were concatenated to produce this file, or do we have that info in the final vcf?
So far I am just concatenating the germline-vcfs from haplotypecaller and strelka, and placing the resulting vcf
<patient>.germline.vcf.gzin the results-folderresults/variant_calling/concat/<patient>.I think it's best to start small, and just do germline snps/indels for now.
@maxime doesn't want the concatenation to be optional.
I do think it's better to have that optional, people can have different usage downstream.
I've set it up so that Sarek puts the concatenated
.vcf.gz-file here:results/variant_calling/concat/<patient>/<patient>.germline.vcf.gzShould there also be a
.tbi-file for the vcf-file?Yes, in my opinion, as long as we produce a vcf.gz, we should have it tabix indexed. Can we create a
results/variant_calling/concat/<patient>/<patient>.germline.txtto list all vcf that were concatenated to produce this file, or do we have that info in the final vcf?
Thanks for the feedback, @maxulysse. Much appreciated. I'll make the concatenation optional somehow :-)
Concerning your idea about the text-file - the vcf-file produced by bcftools concat already contains information about which vcf-files where concatenated:
##bcftools_concatCommand=concat --output test1.germline.vcf.gz --threads 1 test1.strelka.variants.vcf.gz test1.manta.diploid_sv.vcf.gz test1.haplotypecaller.filtered.vcf.gz; Date=Thu Nov 10 21:40:33 2022
I'd say that makes the text-file redundant, right?
I'd say that makes the text-file redundant, right? yes, that's enough for me indeed
@FriederikeHanssen @maxulysse : Can I get you guys to do a preliminary review of this PR?
If this PR looks okay, then I'll update the corresponding modules in github.com/nf-core/modules.
I've tested this PR with the following cmd:
nextflow run main.nf -profile test,singularity --input mapped_joint_bam.fixed.csv -dump-channels -ansi-log false --step variant_calling --concatenate_vcfs --tools cnvkit,deepvariant,freebayes,haplotypecaller,manta,mpileup,strelka,tiddit
and it gives me a concatenated germline-vcf-file which was made by this bcftools concat - command:
##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.bcftools.vcf.gz testN.tiddit.vcf.gz testN.deepvariant.vcf.gz testN.freebayes.vcf.gz testN.manta.diploid_sv.vcf.gz testN.strelka.variants.vcf.gz testN.haplotypecaller.filtered.vcf.gz; Date=Tue Nov 29 10:47:08 2022
(N.B. The cnvkit doesn't produce a vcf-file, so no variants from cnvkit in the concatenated vcf-file.)
In fact, two concatenated vcf-files were produced, since the input-samplesheet contains to bam-files:
results/variant_calling/concat/testN/testN.germline.vcf.gz
results/variant_calling/concat/testT/testT.germline.vcf.gz
The vcf-files are sorted and have corresponding tbi-files.
Warning: This PR contains some real clumsy code: https://github.com/asp8200/sarek/blob/f8edc0034b9f01e3644ae75d7eaf57449581659c/workflows/sarek.nf#L1048-L1060
Do we want to annotate these vcfs on plus the regular ones or instead?
Do we want to annotate these vcfs on plus the regular ones or instead?
I don't know. Right now I'm concatenating the un-annotated, germline vcf-files, and the resulting vcf-file is not getting annotated. I guess ideally the user could decided for himself if he wants the un-annnotated and/or annotated vcf-files concatenated.
Let's annotate as we are doing for now, we'll add annotating concatenated vcfs in a future PR.
Let's annotate as we are doing for now, we'll add annotating concatenated vcfs in a future PR.
Yeah, I'd also say let's just keep this as simple as possible for now, and then later on evaluate if we want to go further with this kind of post-processing of the vcf-files. It may be the case that this kind of post-processing of vcf-files is best left to the users themselves as different users may have very different requirements 🤔
Let's annotate as we are doing for now, we'll add annotating concatenated vcfs in a future PR.
Yeah, I'd also say let's just keep this as simple as possible for now, and then later on evaluate if we want to go further with this kind of post-processing of the vcf-files. It may be the case that this kind of post-processing of vcf-files is best left to the users themselves as different users may have very different requirements thinking
made an issue to keep track of ideas: #878
Damn! All the hard work I did with getting the variant-callers to return index-files all the way back to sarek.nf seems to be redundant, as I'll have to compute new index files after adding the INFO-field ~~SET~~ SOURCE to the vcf-files. The SOURCE-field will contain the name of the file from whense the variant came.
Anyways, this is how it will look:
chr22 3420 . C G 9.1759e-05 . AB=0.2;....;TYPE=snp;technology.illumina=1;SOURCE=testT.freebayes.vcf.gz GT:DP:AD:RO:...
Damn! All the hard work I did with getting the variant-callers to return index-files all the way back to
sarek.nfseems to be redundant, as I'll have to compute new index files after adding the INFO-field ~SET~ SOURCE to the vcf-files.
🙈 oh no
Ok, so I introduced a local module for adding the INFO-field SOURCE=<name-of-input-vcf-file>. Here is the concatenated vcf-file
With the CLI-options --concatenate_vcfs germline-vcf-files from the following variant-callers will be concatenated:
deepvariant
freebayes
haplotypecaller
manta
mpileup
strelka
tiddit
In the attached concatenated vcf-files, there are no variant from manta or tiddit.
What do you guys think about this solution? I'm still passing the index-files from the variant-caller-modules all the way back to sarek.nf; that is actually not necessary with the usage of the local module. Should I get rid of the code passing the index-files from the variant-caller-modules back to sarek.nf? 🤔
I'm still passing the index-files from the variant-caller-modules all the way back to sarek.nf; that is actually not necessary with the usage of the local module. Should I get rid of the code passing the index-files from the variant-caller-modules back to sarek.nf?
The fastest and easiest solution would be just to get rid of the (new) code which is passing the index-files back to sarek.nf, since then I don't have to update anything in nf-core/modules 😁
I'm still passing the index-files from the variant-caller-modules all the way back to sarek.nf; that is actually not necessary with the usage of the local module. Should I get rid of the code passing the index-files from the variant-caller-modules back to sarek.nf?
The fastest and easiest solution would be just to get rid of the (new) code which is passing the index-files back to
sarek.nf, since then I don't have to update anything innf-core/modules😁
@maxulysse asked me to get rid of the redundant code, and so I did.
I now - finally - have all CI-tests passing. Let's merge this thing!
uh nice 🥳 🚀