vg icon indicating copy to clipboard operation
vg copied to clipboard

Output VCF , variants informations (SNPs, Indels, SVs etc)

Open alinehugo opened this issue 2 years ago • 7 comments

1. What were you trying to do? Understand the output VCF of vg call

2. What did you want to happen? Analyse VCF file, first retrieve which kind of variant is present in each position from INFO field as ''usual'' VCF as in SVTYPE in this exemple

3. What actually happened? There's no such an info in the output VCF exemple of output :

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  T004
Chr01       223     >15770416>8     AT      A       20.482  PASS    AT=>15770416>15770417>8,>15770416>8;DP=28       GT:DP:AD:GL:GQ:GP:XD:MAD        0/1:28:24,4:-7.56791,-6.00864,-53.1783:22:-1.10412:19.5098:4

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here: NONE

5. What data and command can the vg dev team use to make the problem happen?

i used usual vg commands pipeline construct > giraffe > augment > snarls-pack > call

6. What does running vg version say?

vg version v1.40.0 "Suardi"
Compiled with g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 on Linux

alinehugo avatar Jun 01 '22 13:06 alinehugo

Analyse VCF file, first retrieve which kind of variant is present in each position from INFO field as ''usual'' VCF as in SVTYPE in this exemple I have the same quaetion. How to get the SVTYPE from the output vcf file of vg?

JD12138 avatar Jun 12 '22 13:06 JD12138

From the VCF spec:

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">

Value should be one of DEL, INS, DUP, INV, CNV, BND. This key can be derived from the REF/ALT fields but is
useful for filtering

The two issues for vg are:

  • Number=1 means there can only be one SVTYPE per site. But vg graph can often contain many different tpyes of SVs at the same site
  • Sometimes it's difficult to assign an allele to one of the simple categories (ie events can be more complex)

But that said, I think you raise a fair point: we should at the very least provide scripts or suggestion of best practices for cleaning of the VCFs and categorizing the SV calls, as we end up doing this ourselves too when analyzing them.

glennhickey avatar Jun 13 '22 12:06 glennhickey

Has there been any progress regarding best practices when it comes to populating the SVTYPE and possibly the SVLEN field in a VCF generated from a pangenome graph? It would be useful to be able to make comparisons to VCFs produced by sniffles, etc. Thanks!

evcurran avatar Feb 13 '23 15:02 evcurran

Hi, @glennhickey. It is hard to understand the 'INFO' field from the vg call ouput. Since users care more for the variant information like SV position, SV type, and SV id as the input vcf file for autoindex-giraffe-pack-call workflow. So it is helpful to output the raw SV information for vg call. I sincerely hope vg team optimize for this problem.

sen1019san avatar May 24 '23 14:05 sen1019san

For anyone coming across this issue with the same problem, I have found the 'truvari' tool useful for populating the INFO field of pangenome-derived VCFs. Running 'truvari anno' (https://github.com/acenglish/truvari/wiki/anno) allows you to include the SVLEN and SVTYPE tag. However it can only accurately label straightforward insertions and deletions, everything else it tagged as 'UNK', so this isn't a perfect solution. It would be great to be able to compare the output with a VCF derived from a tool such as sniffles.

evcurran avatar May 30 '23 08:05 evcurran

Hi, @evcurran. I solved the problem by the similar way. But the key problem is to compare the vcf generated by vg call and the original input vcf, I find it's hard to compre this two vcf file since the variants coordinate are different.

sen1019san avatar Jun 18 '23 16:06 sen1019san