vg icon indicating copy to clipboard operation
vg copied to clipboard

Different versions of vg produce different GAM and VCF files when mapping with vg giraffe and calling variants

Open Ahahaha3 opened this issue 1 year ago • 1 comments

PLEASE DO NOT MAKE SUPPORT REQUESTS HERE

Please the Biostars forum instead:

https://www.biostars.org/new/post/?tag_val=vg Hi, i used the vg=1.35 and vg=1.38 to giraffe for same fastq file, but i got different .gam and .vcf. Such as vg=1.35 product .gam file 17G; .vcf.gz file 9M. But vg=1.38 product .gam file 18G; .vcf.gz file 12M. Why would different versions cause this?

Ahahaha3 avatar Nov 08 '23 13:11 Ahahaha3

We don't actually guarantee identical GAM or VCF output between minor releases. In each version, we fix bugs or make algorithm or parameter changes that could result in different output, especially for tools like Giraffe which rely heavily on heuristics and don't produce a single optimal "correct" answer.

To work this out in detail, you would want to look at the changelogs for the releases after 1.35, up to 1.38:

https://github.com/vgteam/vg/releases/tag/v1.36.0 https://github.com/vgteam/vg/releases/tag/v1.37.0 https://github.com/vgteam/vg/releases/tag/v1.38.0

For example, in 1.36 we changed Giraffe seeding, which we bill as increasing speed but maybe could also result in more/different seeds being picked, leading to different alignments?

  • Giraffe no longer uses duplicate minimizers as often for seeds, potentially increasing mapping speed.

We also started adding more annotations to the Giraffe GAM output, which might make it larger:

  • Giraffe records read and pair mapping wall clock times

If you're concerned that the new GAM files are not just different but might be worse, you can use vg stats -a whatever.gam to get some statistics about the alignments, which you can compare.

adamnovak avatar Nov 15 '23 15:11 adamnovak