sarek Combine all VCFs across different variants caller, such as mutect2, strelka2

Description of feature

We usually combine multiple software results in tumor variants calling, so I think it is necessary to use appropriate methods to combine them, such as union and consistency. I am looking forward to this feature

Sep 08 '22 07:09 Githubguanxudong

So do I <3

Sep 08 '22 07:09 maxulysse

What is the right tool for this? bcftools concat? We already have a module for that:

https://github.com/nf-core/modules/tree/master/modules/nf-core/bcftools/concat

Oct 10 '22 12:10 asp8200

@Githubguanxudong could you elaborate a bit more? We are trying to tackle this during the hackathon. Which tools did you have in mind? Do you have some papers/examples on which level should be merged?

Oct 10 '22 12:10 FriederikeHanssen

As far as I could tell, we need to run bcftools concat with the option --allow-overlaps?

https://samtools.github.io/bcftools/bcftools.html

If one doesn't use --allow-overlaps, the results from bcftools concat a.vcf b.vcf and bcftools concat b.vcf a.vcf may be different.

When running without --allow-overlaps, one may get a warning like

The chromosome block chr1 is not contiguous, consider running with -a.

but the tools still produces an output-file (which is seemingly missing some variants from the input-files!?).

It is not clear to me what the option --allow-overlaps actually does! 😬

Oct 10 '22 16:10 asp8200

@FriederikeHanssen and @maxulysse : The vcf-file resulting from the concatenation may contain the same variant several times:

chr22   2123    .       C       G       28      PASS    SNVHPOL=3;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL    0/1:30:12:3:2:1,2:0,1:1,1:-6.9:PASS:63,0,27
chr22   3140    .       A       G       42      LowGQX;LowDepth;NoPassedVariantGTs      SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL    1/1:5:3:2:0:0,2:0,1:0,1:-8.9:LowGQX;LowDepth:78,6,0
[...]
chr22   2123    .       C       G       33.97   PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=0.967;CNN_1D=3.270;DP=3;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=11.32;ReadPosRankSum=0.967;SOR=0.223  GT:AD:DP:GQ:PL  0/1:1,2:3:1:41,0,1

Such duplicates could perhaps be removed by the option --remove-duplicates for bcftools concat, but is that what we want?

Also, the vcf-file is not sorted. Should it perhaps be sorted?

Nov 10 '22 21:11 asp8200

Hi @Githubguanxudong ! How do you usually deal with the above? :)

Nov 10 '22 21:11 FriederikeHanssen

@FriederikeHanssen I was tagged by @asp8200 to look at this. We implemented a version of combining VCFs from structural variant callers before. We wanted to concatenate SV-calls from Manta, Cnvnator, Lumpy and Delly. So four different structural variant callers. The way our interpreters wanted it was that no variant calls were merged. One reason being that it can be hard to come to an agreement on which variants to merge. E.g. how large the overlap should be to merge. And for translocations how close the break-points should be. They also wanted to know from which variant caller the data came. We solved the concatenation using svdb --merge [https://github.com/J35P312/SVDB] and tweaked it in a way that it did not merge any calls :). Not even the duplicated calls were merged as they wanted to keep the calls separate, being able to filter in VarSeq on specific callers. The vcf-files were sorted bcftools sort -O z before and after merging and indexed using tabix -p vcf. Hope this is of any help. Just ping me if you have any questions.

Nov 14 '22 14:11 amasplund

We've decided that - for now - we just do concatenation of germline vcf-files.

Nov 28 '22 13:11 asp8200

@maxulysse @FriederikeHanssen @amasplund : I'll run bcftools sort z followed by tabix -p vcf on the concatenated vcf-file, ok?

Nov 28 '22 13:11 asp8200

So here I ran deepvariant, mpileup, strelka and freebayes, and I get the same variant called three times:

##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.deepvariant.vcf.gz testN.bcftools.vcf.gz testN.strelka.variants.vcf.gz testN.freebayes.vcf.gz; Date=Mon Nov 28 15:50:15
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  testN
chr22   1982    .       A       G       0.1     RefCall .       GT:GQ:DP:AD:VAF:PL      ./.:16:286:238,48:0.167832:0,15,26
chr22   1982    .       A       G       220.18  .       DP=251;VDB=9.63142e-13;SGB=-0.692831;RPBZ=0.633339;MQBZ=0;MQSBZ=0;BQBZ=0.628639;NMBZ=-1.07033;SCBZ=-0.471405;FS=0;MQ0F=0;AC=1;AN=2;DP4=61,47,11,13;MQ=60        GT:PL   0/1:255,0,255
chr22   1982    .       A       G       459.724 .       AB=0.167832;ABP=277.102;AC=1;AF=0.5;AN=2;AO=48;CIGAR=1X;DP=286;DPB=286;DPRA=0;EPP=3.0103;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=105.855;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=2017;QR=9863;RO=238;RPL=0;RPP=107.241;RPPR=519.821;RPR=48;RUN=1;SAF=24;SAP=3.0103;SAR=24;SRF=119;SRP=3.0103;SRR=119;TYPE=snp;technology.illumina=1   GT:DP:AD:RO:QR:AO:QA:GL 0/1:286:238,48:238:9863:48:2017:-95.4031,0,-799.898

I wonder if that is what the users would want? 🤔

Nov 28 '22 14:11 asp8200

So here I ran deepvariant, mpileup, strelka and freebayes, and I get the same variant called three times:

##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.deepvariant.vcf.gz testN.bcftools.vcf.gz testN.strelka.variants.vcf.gz testN.freebayes.vcf.gz; Date=Mon Nov 28 15:50:15
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  testN
chr22   1982    .       A       G       0.1     RefCall .       GT:GQ:DP:AD:VAF:PL      ./.:16:286:238,48:0.167832:0,15,26
chr22   1982    .       A       G       220.18  .       DP=251;VDB=9.63142e-13;SGB=-0.692831;RPBZ=0.633339;MQBZ=0;MQSBZ=0;BQBZ=0.628639;NMBZ=-1.07033;SCBZ=-0.471405;FS=0;MQ0F=0;AC=1;AN=2;DP4=61,47,11,13;MQ=60        GT:PL   0/1:255,0,255
chr22   1982    .       A       G       459.724 .       AB=0.167832;ABP=277.102;AC=1;AF=0.5;AN=2;AO=48;CIGAR=1X;DP=286;DPB=286;DPRA=0;EPP=3.0103;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=105.855;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=2017;QR=9863;RO=238;RPL=0;RPP=107.241;RPPR=519.821;RPR=48;RUN=1;SAF=24;SAP=3.0103;SAR=24;SRF=119;SRP=3.0103;SRR=119;TYPE=snp;technology.illumina=1   GT:DP:AD:RO:QR:AO:QA:GL 0/1:286:238,48:238:9863:48:2017:-95.4031,0,-799.898

I wonder if that is what the users would want? thinking

So this is just the germline indels? If I understand well, in that case they're all concatenated together, so that's what I would expect. But I'm thinking this is just the first step, and people might also want the variants to be merged, so in your case, having just a final one variant instead of 3

Nov 28 '22 15:11 maxulysse

yeah, is the meta info also the same across all?

Nov 28 '22 15:11 FriederikeHanssen

yeah, is the meta info also the same across all?

Not sure I understand what you mean 🤔 The different variant-callers produces vcf-files with different INFO- and FORMAT-fields.

Nov 28 '22 15:11 asp8200

yeah maybe this info is interesting to keep? Not sure if we want to just throw it out to only keep one line per variant or if there is a way to merge these fields

Nov 28 '22 15:11 FriederikeHanssen

Here is the concatenated (and sorted) vcf-file resulting from the following cmd:

nextflow run main.nf -profile test,singularity  --input mapped_joint_bam.fixed.csv -dump-channels -ansi-log false --step variant_calling --concatenate_vcfs --tools deepvariant,freebayes,strelka,mpileup

testN.germline.vcf.gz

It is basically the result of running bcftools concat. I suggest that we just go with that for now.

Nov 28 '22 15:11 asp8200

yes sounds good. can always add more later

Nov 28 '22 17:11 FriederikeHanssen

Just for the record: the cnvkit doesn't produce any vcf-files, so the concatenated vcf-file doesn't contain variants from the cnvkit.

Nov 28 '22 19:11 asp8200

Some comments:

My experience is that CNVkit works really bad without a reference dataset as it uses coverage distributions (log2ratios I think) to determine if something is out of the ordinary.
As the programs have different strengths (in regards to calling specific types of structural variants) and different ways of annotating the variants with meta data (e.g. quality), our users were interested in keeping all the info.
I can't find info about the origin of the variants (which variant caller) in the example Anders sent https://github.com/nf-core/sarek/files/10105274/testN.germline.vcf.gz. The experienced user can probably tell from the INFO field, but our users wanted this info. Is there a way of adding this info using bcftools concat.

Nov 30 '22 12:11 amasplund

Some comments:

My experience is that CNVkit works really bad without a reference dataset as it uses coverage distributions (log2ratios I think) to determine if something is out of the ordinary.

As the programs have different strengths (in regards to calling specific types of structural variants) and different ways of annotating the variants with meta data (e.g. quality), our users were interested in keeping all the info.

I can't find info about the origin of the variants (which variant caller) in the example Anders sent https://github.com/nf-core/sarek/files/10105274/testN.germline.vcf.gz. The experienced user can probably tell from the INFO field, but our users wanted this info. Is there a way of adding this info using bcftools concat.

@amasplund : Thanks for your feedback. It is much appreciated.

The vcf-file testN.germline.vcf.gz just contains the following line which shows which input-vcf-files were used:

##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.deepvariant.vcf.gz testN.strelka.variants.vcf.gz testN.bcftools.vcf.gz testN.freebayes.vcf.gz; Date=Mon Nov 28 16:05:37 2022

I could imagine that it would be useful to have an INFO-field for each variant stating which variant-caller (or perhaps more realistically which vcf-file) found the given variant. Something like, for instance, VC=strelka. However, as far as I can see, bcftools concat doesn't support something like that.

Nov 30 '22 12:11 asp8200

@asp8200 Can you ad "VC=strelka" manually before the concat? Or is that to hacky?? :) Using svdb e.g. "set=delly" was added to the INFO field.

Nov 30 '22 12:11 amasplund

@asp8200 Maybe the best is to send the example to some "real" interpreters and see what they think :). With the risk of getting divergent replies :).

Nov 30 '22 12:11 amasplund

@asp8200 you could also ping the sarek community on slack to get some input. But as @amasplund mentioned quite possible we get some divergent opinions there, but that is always the case 😆 I like the idea of keeping all this INFO. Better a little more meta info than to little and to see how all of them compare this would be quite useful.

Nov 30 '22 13:11 FriederikeHanssen

A suggestion for CNVkit (our users like it a lot), which is more work demanding though, is to have a CNVkit option where you add a reference .cnn file. https://cnvkit.readthedocs.io/en/stable/pipeline.html I haven't done this myself so I am not into the details. Users could then ad this for getting nice variant calls from CNVkit. I don't know if there is a NF-core pipeline for creating a CNVkit reference dataset?? Maybe for a future release :).

Nov 30 '22 14:11 amasplund

@amasplund a bit of topic here: but with the newest sarek release you can submit your own reference.cnn. In addition, Maxime has just proposed a new workflow to create all these cohort-of-normal like references, like: pon for mutect2, one for msisensorpro and cnvkit

Nov 30 '22 15:11 FriederikeHanssen

Coming back to the point about adding an INFO-field to the vcf-files before concatenating them:

@amasplund come up with a nice, little bash+awk-script for adding some "constant" INFO-field to a vcf-file. How would Sarek run that script before doing the concatenation? As far as I can tell, there are two options:

I make a "local" module with the bash-script, or
I try to get the module (with the bash-script) enrolled in github.com/nf-core/modules

Neither of those solutions seem ideal. Anybody else have any suggestions on how to do this?

Dec 01 '22 08:12 asp8200

Closed by #792

Dec 07 '22 08:12 asp8200