strange handling of SVs by bcftools norm --fasta-ref
Hi bcftools,
I'm part of a project building a pangenome and we noticed some strange output by bcftools norm in terms of how it handles structural variants. I thought you may have some good recommendations, or would want to be aware of the strange behavior.
Here is the command I ran:
bcftools norm --fasta-ref $REF_FASTA input.vcf -o output.vcf
We notice two main problems. The POS of larger structural variants is shifted many more base pairs away in some cases then we'd anticipate. For example, a 8887bp insertion at CHR1:41577 is shifted 140bp away to CHR1:41437.
We also noticed a particularly difficult case. In the original vcf (prior to normalization/left alignment), there are 5 structural variants distributed across two sites. At CHR1:671683 the individuals Moly and Tany have a >200bp insertion. The third individual, Pach, has reference genotype. At site CHR1:671691, all three individuals have >150bp insertion. After normalization and left alignment, all of the variants at the second site (CHR1:671691) are reassigned to CHR1:671683. This makes it appear as if there are conflicting alleles at the same site in our vcf.
I'm aware that there are options to rm-dups for example, or collapse variants. However, that's removing information we know is there based on the other outputs from the pangenome. For example, I would like to avoid representing the inserted sequence in Moly as one 200bp insertion when we know that at least 350bp are inserted in this region. Any feedback is appreciated. Thank you!
Can you provide a small test case? It is not possible to comment on these specific cases without seeing the data
Sure. Here are two vcfs with the variant I mentioned. One of the vcfs is before bcftools normalization, and the other is after.
And I'll just reiterate that this vcf is representing variants in a pangenome graph. So perhaps, relying on a single reference instead of the GFA of the graph to normalize is causing issues. Let me know if you need more test examples. I'm a little busy, but I'm happy to help. You may be interested in this correspondence about other related issues. https://github.com/ComparativeGenomicsToolkit/cactus/issues/1557#issuecomment-2528989634 [https://opengraph.githubassets.com/6159a8caa56f300b368335b3df268c1fc9d029bacf18d6cd597f892d5ad19c57/ComparativeGenomicsToolkit/cactus/issues/1557]https://github.com/ComparativeGenomicsToolkit/cactus/issues/1557#issuecomment-2528989634 Issue #1557 · ComparativeGenomicsToolkit/cactus - GitHubhttps://github.com/ComparativeGenomicsToolkit/cactus/issues/1557#issuecomment-2528989634 Hi, Thanks for your help with my previous question, Glenn. I have another question about a different pangenome I'm working on. We built it last year with cactus-minigraph pangenome pipeline (v2.5.1). The input for our pangenome are four ... github.com
Maggs
From: Petr Danecek @.> Sent: Tuesday, December 10, 2024 12:28 AM To: samtools/bcftools @.> Cc: maggs-x @.>; Author @.> Subject: Re: [samtools/bcftools] strange handling of SVs by bcftools norm --fasta-ref (Issue #2330)
Can you provide a small test case? It is not possible to comment on these specific cases without seeing the data
— Reply to this email directly, view it on GitHubhttps://github.com/samtools/bcftools/issues/2330#issuecomment-2527958053, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A7HSYLW2RONK46DCJT63S7L2EWLJHAVCNFSM6AAAAABTD4W6KOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRXHE2TQMBVGM. You are receiving this because you authored the thread.Message ID: @.***>
Unfortunately, this is not sufficient, we need the fasta reference and the input VCF file.