sarek
sarek copied to clipboard
Combine all VCFs across different variants caller, such as mutect2, strelka2
Description of feature
We usually combine multiple software results in tumor variants calling, so I think it is necessary to use appropriate methods to combine them, such as union and consistency. I am looking forward to this feature
So do I <3
What is the right tool for this? bcftools concat
? We already have a module for that:
https://github.com/nf-core/modules/tree/master/modules/nf-core/bcftools/concat
@Githubguanxudong could you elaborate a bit more? We are trying to tackle this during the hackathon. Which tools did you have in mind? Do you have some papers/examples on which level should be merged?
As far as I could tell, we need to run bcftools concat
with the option --allow-overlaps
?
https://samtools.github.io/bcftools/bcftools.html
If one doesn't use --allow-overlaps
, the results from bcftools concat a.vcf b.vcf
and bcftools concat b.vcf a.vcf
may be different.
When running without --allow-overlaps
, one may get a warning like
The chromosome block chr1 is not contiguous, consider running with -a.
but the tools still produces an output-file (which is seemingly missing some variants from the input-files!?).
It is not clear to me what the option --allow-overlaps
actually does! 😬
@FriederikeHanssen and @maxulysse : The vcf-file resulting from the concatenation may contain the same variant several times:
chr22 2123 . C G 28 PASS SNVHPOL=3;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 0/1:30:12:3:2:1,2:0,1:1,1:-6.9:PASS:63,0,27
chr22 3140 . A G 42 LowGQX;LowDepth;NoPassedVariantGTs SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1/1:5:3:2:0:0,2:0,1:0,1:-8.9:LowGQX;LowDepth:78,6,0
[...]
chr22 2123 . C G 33.97 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.967;CNN_1D=3.270;DP=3;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=11.32;ReadPosRankSum=0.967;SOR=0.223 GT:AD:DP:GQ:PL 0/1:1,2:3:1:41,0,1
Such duplicates could perhaps be removed by the option --remove-duplicates
for bcftools concat
, but is that what we want?
Also, the vcf-file is not sorted. Should it perhaps be sorted?
Hi @Githubguanxudong ! How do you usually deal with the above? :)
@FriederikeHanssen
I was tagged by @asp8200 to look at this. We implemented a version of combining VCFs from structural variant callers before. We wanted to concatenate SV-calls from Manta, Cnvnator, Lumpy and Delly. So four different structural variant callers. The way our interpreters wanted it was that no variant calls were merged. One reason being that it can be hard to come to an agreement on which variants to merge. E.g. how large the overlap should be to merge. And for translocations how close the break-points should be. They also wanted to know from which variant caller the data came. We solved the concatenation using svdb --merge
[https://github.com/J35P312/SVDB]
and tweaked it in a way that it did not merge any calls :). Not even the duplicated calls were merged as they wanted to keep the calls separate, being able to filter in VarSeq on specific callers.
The vcf-files were sorted bcftools sort -O z
before and after merging and indexed using tabix -p vcf
.
Hope this is of any help. Just ping me if you have any questions.
We've decided that - for now - we just do concatenation of germline vcf-files.
@maxulysse @FriederikeHanssen @amasplund : I'll run bcftools sort z
followed by tabix -p vcf
on the concatenated vcf-file, ok?
So here I ran deepvariant, mpileup, strelka and freebayes, and I get the same variant called three times:
##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.deepvariant.vcf.gz testN.bcftools.vcf.gz testN.strelka.variants.vcf.gz testN.freebayes.vcf.gz; Date=Mon Nov 28 15:50:15
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT testN
chr22 1982 . A G 0.1 RefCall . GT:GQ:DP:AD:VAF:PL ./.:16:286:238,48:0.167832:0,15,26
chr22 1982 . A G 220.18 . DP=251;VDB=9.63142e-13;SGB=-0.692831;RPBZ=0.633339;MQBZ=0;MQSBZ=0;BQBZ=0.628639;NMBZ=-1.07033;SCBZ=-0.471405;FS=0;MQ0F=0;AC=1;AN=2;DP4=61,47,11,13;MQ=60 GT:PL 0/1:255,0,255
chr22 1982 . A G 459.724 . AB=0.167832;ABP=277.102;AC=1;AF=0.5;AN=2;AO=48;CIGAR=1X;DP=286;DPB=286;DPRA=0;EPP=3.0103;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=105.855;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=2017;QR=9863;RO=238;RPL=0;RPP=107.241;RPPR=519.821;RPR=48;RUN=1;SAF=24;SAP=3.0103;SAR=24;SRF=119;SRP=3.0103;SRR=119;TYPE=snp;technology.illumina=1 GT:DP:AD:RO:QR:AO:QA:GL 0/1:286:238,48:238:9863:48:2017:-95.4031,0,-799.898
I wonder if that is what the users would want? 🤔
So here I ran deepvariant, mpileup, strelka and freebayes, and I get the same variant called three times:
##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.deepvariant.vcf.gz testN.bcftools.vcf.gz testN.strelka.variants.vcf.gz testN.freebayes.vcf.gz; Date=Mon Nov 28 15:50:15 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT testN chr22 1982 . A G 0.1 RefCall . GT:GQ:DP:AD:VAF:PL ./.:16:286:238,48:0.167832:0,15,26 chr22 1982 . A G 220.18 . DP=251;VDB=9.63142e-13;SGB=-0.692831;RPBZ=0.633339;MQBZ=0;MQSBZ=0;BQBZ=0.628639;NMBZ=-1.07033;SCBZ=-0.471405;FS=0;MQ0F=0;AC=1;AN=2;DP4=61,47,11,13;MQ=60 GT:PL 0/1:255,0,255 chr22 1982 . A G 459.724 . AB=0.167832;ABP=277.102;AC=1;AF=0.5;AN=2;AO=48;CIGAR=1X;DP=286;DPB=286;DPRA=0;EPP=3.0103;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=105.855;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=2017;QR=9863;RO=238;RPL=0;RPP=107.241;RPPR=519.821;RPR=48;RUN=1;SAF=24;SAP=3.0103;SAR=24;SRF=119;SRP=3.0103;SRR=119;TYPE=snp;technology.illumina=1 GT:DP:AD:RO:QR:AO:QA:GL 0/1:286:238,48:238:9863:48:2017:-95.4031,0,-799.898
I wonder if that is what the users would want? thinking
So this is just the germline indels? If I understand well, in that case they're all concatenated together, so that's what I would expect. But I'm thinking this is just the first step, and people might also want the variants to be merged, so in your case, having just a final one variant instead of 3
yeah, is the meta info also the same across all?
yeah, is the meta info also the same across all?
Not sure I understand what you mean 🤔 The different variant-callers produces vcf-files with different INFO- and FORMAT-fields.
yeah maybe this info is interesting to keep? Not sure if we want to just throw it out to only keep one line per variant or if there is a way to merge these fields
Here is the concatenated (and sorted) vcf-file resulting from the following cmd:
nextflow run main.nf -profile test,singularity --input mapped_joint_bam.fixed.csv -dump-channels -ansi-log false --step variant_calling --concatenate_vcfs --tools deepvariant,freebayes,strelka,mpileup
It is basically the result of running bcftools concat
. I suggest that we just go with that for now.
yes sounds good. can always add more later
Just for the record: the cnvkit doesn't produce any vcf-files, so the concatenated vcf-file doesn't contain variants from the cnvkit.
Some comments:
- My experience is that CNVkit works really bad without a reference dataset as it uses coverage distributions (log2ratios I think) to determine if something is out of the ordinary.
- As the programs have different strengths (in regards to calling specific types of structural variants) and different ways of annotating the variants with meta data (e.g. quality), our users were interested in keeping all the info.
- I can't find info about the origin of the variants (which variant caller) in the example Anders sent https://github.com/nf-core/sarek/files/10105274/testN.germline.vcf.gz. The experienced user can probably tell from the INFO field, but our users wanted this info. Is there a way of adding this info using bcftools concat.
Some comments:
- My experience is that CNVkit works really bad without a reference dataset as it uses coverage distributions (log2ratios I think) to determine if something is out of the ordinary.
- As the programs have different strengths (in regards to calling specific types of structural variants) and different ways of annotating the variants with meta data (e.g. quality), our users were interested in keeping all the info.
- I can't find info about the origin of the variants (which variant caller) in the example Anders sent https://github.com/nf-core/sarek/files/10105274/testN.germline.vcf.gz. The experienced user can probably tell from the INFO field, but our users wanted this info. Is there a way of adding this info using bcftools concat.
@amasplund : Thanks for your feedback. It is much appreciated.
The vcf-file testN.germline.vcf.gz
just contains the following line which shows which input-vcf-files were used:
##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.deepvariant.vcf.gz testN.strelka.variants.vcf.gz testN.bcftools.vcf.gz testN.freebayes.vcf.gz; Date=Mon Nov 28 16:05:37 2022
I could imagine that it would be useful to have an INFO-field for each variant stating which variant-caller (or perhaps more realistically which vcf-file) found the given variant. Something like, for instance, VC=strelka
. However, as far as I can see, bcftools concat
doesn't support something like that.
@asp8200 Can you ad "VC=strelka" manually before the concat? Or is that to hacky?? :) Using svdb e.g. "set=delly" was added to the INFO field.
@asp8200 Maybe the best is to send the example to some "real" interpreters and see what they think :). With the risk of getting divergent replies :).
@asp8200 you could also ping the sarek community on slack to get some input. But as @amasplund mentioned quite possible we get some divergent opinions there, but that is always the case 😆 I like the idea of keeping all this INFO. Better a little more meta info than to little and to see how all of them compare this would be quite useful.
A suggestion for CNVkit (our users like it a lot), which is more work demanding though, is to have a CNVkit option where you add a reference .cnn file. https://cnvkit.readthedocs.io/en/stable/pipeline.html I haven't done this myself so I am not into the details. Users could then ad this for getting nice variant calls from CNVkit. I don't know if there is a NF-core pipeline for creating a CNVkit reference dataset?? Maybe for a future release :).
@amasplund a bit of topic here: but with the newest sarek release you can submit your own reference.cnn. In addition, Maxime has just proposed a new workflow to create all these cohort-of-normal like references, like: pon for mutect2, one for msisensorpro and cnvkit
Coming back to the point about adding an INFO-field to the vcf-files before concatenating them:
@amasplund come up with a nice, little bash+awk-script for adding some "constant" INFO-field to a vcf-file. How would Sarek run that script before doing the concatenation? As far as I can tell, there are two options:
- I make a "local" module with the bash-script, or
- I try to get the module (with the bash-script) enrolled in github.com/nf-core/modules
Neither of those solutions seem ideal. Anybody else have any suggestions on how to do this?
Closed by #792