sarek icon indicating copy to clipboard operation
sarek copied to clipboard

DRAFT: Concatenating vcfs

Open asp8200 opened this issue 2 years ago • 1 comments

This is a DRAFT PR for #738.

Still lots more to be done and things that needs to be discussed.

So far I am just concatenating the germline-vcfs from haplotypecaller and strelka, and placing the resulting vcf <patient>.germline.vcf.gz in the results-folder results/variant_calling/concat/<patient>.

@maxime doesn't want the concatenation to be optional.

I've set it up so that Sarek puts the concatenated .vcf.gz-file here:

results/variant_calling/concat/<patient>/<patient>.germline.vcf.gz

Should there also be a .tbi-file for the vcf-file?

PR checklist

  • [ ] This comment contains a description of changes (with reason).
  • [ ] If you've fixed a bug or added code that should be tested, add tests!
  • [ ] If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/sarek branch on the nf-core/test-datasets repository.
  • [ ] Make sure your code lints (nf-core lint).
  • [ ] Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • [ ] Usage Documentation in docs/usage.md is updated.
  • [ ] Output Documentation in docs/output.md is updated.
  • [ ] CHANGELOG.md is updated.
  • [ ] README.md is updated (including new tool citations and authors/contributors).

asp8200 avatar Oct 11 '22 14:10 asp8200

nf-core lint overall result: Passed :white_check_mark: :warning:

Posted for pipeline commit b32b4cf

+| ✅ 151 tests passed       |+
#| ❔   8 tests were ignored |#
!| ❗   2 tests had warnings |!

:heavy_exclamation_mark: Test warnings:

  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your prefered methods description, e.g. add publication citation for this pipeline
  • schema_description - No description provided in schema for parameter: cnvkit_reference

:grey_question: Tests ignored:

:white_check_mark: Tests passed:

Run details

  • nf-core/tools version 2.6
  • Run at 2022-12-06 19:06:23

github-actions[bot] avatar Oct 11 '22 15:10 github-actions[bot]

So far I am just concatenating the germline-vcfs from haplotypecaller and strelka, and placing the resulting vcf <patient>.germline.vcf.gz in the results-folder results/variant_calling/concat/<patient>.

I think it's best to start small, and just do germline snps/indels for now.

@maxime doesn't want the concatenation to be optional.

I do think it's better to have that optional, people can have different usage downstream.

I've set it up so that Sarek puts the concatenated .vcf.gz-file here:

results/variant_calling/concat/<patient>/<patient>.germline.vcf.gz

Should there also be a .tbi-file for the vcf-file?

Yes, in my opinion, as long as we produce a vcf.gz, we should have it tabix indexed. Can we create a results/variant_calling/concat/<patient>/<patient>.germline.txt to list all vcf that were concatenated to produce this file, or do we have that info in the final vcf?

maxulysse avatar Nov 11 '22 10:11 maxulysse

So far I am just concatenating the germline-vcfs from haplotypecaller and strelka, and placing the resulting vcf <patient>.germline.vcf.gz in the results-folder results/variant_calling/concat/<patient>.

I think it's best to start small, and just do germline snps/indels for now.

@maxime doesn't want the concatenation to be optional.

I do think it's better to have that optional, people can have different usage downstream.

I've set it up so that Sarek puts the concatenated .vcf.gz-file here:

results/variant_calling/concat/<patient>/<patient>.germline.vcf.gz

Should there also be a .tbi-file for the vcf-file?

Yes, in my opinion, as long as we produce a vcf.gz, we should have it tabix indexed. Can we create a results/variant_calling/concat/<patient>/<patient>.germline.txt to list all vcf that were concatenated to produce this file, or do we have that info in the final vcf?

Thanks for the feedback, @maxulysse. Much appreciated. I'll make the concatenation optional somehow :-)

Concerning your idea about the text-file - the vcf-file produced by bcftools concat already contains information about which vcf-files where concatenated:

##bcftools_concatCommand=concat --output test1.germline.vcf.gz --threads 1 test1.strelka.variants.vcf.gz test1.manta.diploid_sv.vcf.gz test1.haplotypecaller.filtered.vcf.gz; Date=Thu Nov 10 21:40:33 2022

I'd say that makes the text-file redundant, right?

asp8200 avatar Nov 11 '22 10:11 asp8200

I'd say that makes the text-file redundant, right? yes, that's enough for me indeed

maxulysse avatar Nov 11 '22 10:11 maxulysse

@FriederikeHanssen @maxulysse : Can I get you guys to do a preliminary review of this PR?

If this PR looks okay, then I'll update the corresponding modules in github.com/nf-core/modules.

I've tested this PR with the following cmd:

nextflow run main.nf -profile test,singularity  --input mapped_joint_bam.fixed.csv -dump-channels -ansi-log false --step variant_calling --concatenate_vcfs --tools cnvkit,deepvariant,freebayes,haplotypecaller,manta,mpileup,strelka,tiddit

and it gives me a concatenated germline-vcf-file which was made by this bcftools concat - command:

##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.bcftools.vcf.gz testN.tiddit.vcf.gz testN.deepvariant.vcf.gz testN.freebayes.vcf.gz testN.manta.diploid_sv.vcf.gz testN.strelka.variants.vcf.gz testN.haplotypecaller.filtered.vcf.gz; Date=Tue Nov 29 10:47:08 2022

(N.B. The cnvkit doesn't produce a vcf-file, so no variants from cnvkit in the concatenated vcf-file.)

In fact, two concatenated vcf-files were produced, since the input-samplesheet contains to bam-files:

results/variant_calling/concat/testN/testN.germline.vcf.gz
results/variant_calling/concat/testT/testT.germline.vcf.gz

testN.germline.vcf.gz

The vcf-files are sorted and have corresponding tbi-files.

Warning: This PR contains some real clumsy code: https://github.com/asp8200/sarek/blob/f8edc0034b9f01e3644ae75d7eaf57449581659c/workflows/sarek.nf#L1048-L1060

asp8200 avatar Nov 29 '22 10:11 asp8200

Do we want to annotate these vcfs on plus the regular ones or instead?

maxulysse avatar Dec 01 '22 10:12 maxulysse

Do we want to annotate these vcfs on plus the regular ones or instead?

I don't know. Right now I'm concatenating the un-annotated, germline vcf-files, and the resulting vcf-file is not getting annotated. I guess ideally the user could decided for himself if he wants the un-annnotated and/or annotated vcf-files concatenated.

asp8200 avatar Dec 01 '22 10:12 asp8200

Let's annotate as we are doing for now, we'll add annotating concatenated vcfs in a future PR.

maxulysse avatar Dec 01 '22 10:12 maxulysse

Let's annotate as we are doing for now, we'll add annotating concatenated vcfs in a future PR.

Yeah, I'd also say let's just keep this as simple as possible for now, and then later on evaluate if we want to go further with this kind of post-processing of the vcf-files. It may be the case that this kind of post-processing of vcf-files is best left to the users themselves as different users may have very different requirements 🤔

asp8200 avatar Dec 01 '22 10:12 asp8200

Let's annotate as we are doing for now, we'll add annotating concatenated vcfs in a future PR.

Yeah, I'd also say let's just keep this as simple as possible for now, and then later on evaluate if we want to go further with this kind of post-processing of the vcf-files. It may be the case that this kind of post-processing of vcf-files is best left to the users themselves as different users may have very different requirements thinking

made an issue to keep track of ideas: #878

maxulysse avatar Dec 01 '22 10:12 maxulysse

Damn! All the hard work I did with getting the variant-callers to return index-files all the way back to sarek.nf seems to be redundant, as I'll have to compute new index files after adding the INFO-field ~~SET~~ SOURCE to the vcf-files. The SOURCE-field will contain the name of the file from whense the variant came.

Anyways, this is how it will look:

chr22   3420    .       C       G       9.1759e-05      .       AB=0.2;....;TYPE=snp;technology.illumina=1;SOURCE=testT.freebayes.vcf.gz     GT:DP:AD:RO:...

asp8200 avatar Dec 01 '22 13:12 asp8200

Damn! All the hard work I did with getting the variant-callers to return index-files all the way back to sarek.nf seems to be redundant, as I'll have to compute new index files after adding the INFO-field ~SET~ SOURCE to the vcf-files.

🙈 oh no

FriederikeHanssen avatar Dec 01 '22 21:12 FriederikeHanssen

Ok, so I introduced a local module for adding the INFO-field SOURCE=<name-of-input-vcf-file>. Here is the concatenated vcf-file

testN.germline.vcf.gz

With the CLI-options --concatenate_vcfs germline-vcf-files from the following variant-callers will be concatenated:

deepvariant
freebayes
haplotypecaller
manta
mpileup
strelka
tiddit

In the attached concatenated vcf-files, there are no variant from manta or tiddit.

What do you guys think about this solution? I'm still passing the index-files from the variant-caller-modules all the way back to sarek.nf; that is actually not necessary with the usage of the local module. Should I get rid of the code passing the index-files from the variant-caller-modules back to sarek.nf? 🤔

asp8200 avatar Dec 01 '22 22:12 asp8200

I'm still passing the index-files from the variant-caller-modules all the way back to sarek.nf; that is actually not necessary with the usage of the local module. Should I get rid of the code passing the index-files from the variant-caller-modules back to sarek.nf?

The fastest and easiest solution would be just to get rid of the (new) code which is passing the index-files back to sarek.nf, since then I don't have to update anything in nf-core/modules 😁

asp8200 avatar Dec 06 '22 09:12 asp8200

I'm still passing the index-files from the variant-caller-modules all the way back to sarek.nf; that is actually not necessary with the usage of the local module. Should I get rid of the code passing the index-files from the variant-caller-modules back to sarek.nf?

The fastest and easiest solution would be just to get rid of the (new) code which is passing the index-files back to sarek.nf, since then I don't have to update anything in nf-core/modules 😁

@maxulysse asked me to get rid of the redundant code, and so I did.

I now - finally - have all CI-tests passing. Let's merge this thing!

asp8200 avatar Dec 06 '22 20:12 asp8200

uh nice 🥳 🚀

FriederikeHanssen avatar Dec 07 '22 12:12 FriederikeHanssen