aldy icon indicating copy to clipboard operation
aldy copied to clipboard

Discrepancies between Aldy results from BAM vs VCF files (hg19 reference genome)

Open paraveeown opened this issue 3 months ago • 2 comments

Dear Aldy Team,

I analyzed targeted sequencing data of 100 PGx genes aligned to the hg19 reference genome using both BAM and VCF files in Aldy v4.4 (Python 3.9.7 on Linux 3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17). Discrepancies in star allele calls between BAM and VCF of the same sample were often understandable, attributed to the fact that alleles with CNVs could not be detected from VCF.

However, some differences could not be explained by this limitation.

For instance, in CYP2C19, sample that was called *1/*1 from the BAM file got *38/*38 from the VCF file.

BAM analysis: *1/*1 image VCF analysis: *38/*38 image

Based on PharmVar: *38 = wild-type (no variant) *1 = rs3758581 (A>G)

In the genome browser, the BAM file of this sample showed homozygous G at this position. Since G is the reference allele in hg19, it wasn't shown as a variant in the VCF file. image

The BAM file showed homozygous G, so it's logical that Aldy called *1/*1 for this sample. However, I'm confused about why the VCF file, which also represented homozygous G, didn't produce the same result?

By the way, I noticed an interesting difference between reference genomes. In GRCh38, the G allele at this position is considered the alternate allele, making it a variant in VCF file, unlike in hg19. image

Could it be that the program expected the G allele of CYP2C19*1 to be recognized as a variant in the VCF file (like in GRCh38)? However, because it is a reference allele in hg19 and doesn't appear in the VCF, the program skipped it?

The VCF log file also indicates that the *1 allele was removed because rs3758581 wasn't present, resulting in it being called as *38/*38 instead. image


Another gene with a similar discrepancy is CYP3A7. Sample that was called *2/*2 when using the BAM file got *1/*1 from the VCF file.

BAM analysis: *2/*2 image VCF analysis: *1/*1 image

However, sample that was called *1/*2 from the BAM file also got *1/*2 from the VCF file. BAM analysis: *1/*2 image VCF analysis: *1/*2 image

This suggested that *2 might not be detected in VCF files only when it is homozygous?

Based on PharmVar: *1 = wild-type (no variant) *2 = rs2257401 (G>C on positive strand)

In the genome browser, sample 392 with *2/*2 showed homozygous C at this position. Since C is the reference allele in hg19, it did not show up in the VCF. Sample 394 with *1/*2 had both G and C at this position. Since G is the alternate allele in hg19, it was included in the VCF. image

Similar to the previously mentioned CYP2C19 SNP, the C allele at this position is the reference allele in hg19, but is an alternate allele in GRCh38. image

So, again, it seems that the program might anticipate the C allele of CYP3A7*2 to be in the VCF file (like in GRCh38), but because it is a reference allele in hg19 and was not present in the VCF, the program did not considered it and called *1 instead.

Can anyone please check if I've understood this correctly? Or are there other possible reasons for these differences in Aldy results from the BAM and VCF files?

Feel free to ask for more information, and thank you in advance.

Paravee

paraveeown avatar Apr 06 '24 14:04 paraveeown