graphtyper icon indicating copy to clipboard operation
graphtyper copied to clipboard

The genome size is too large to genotype for some genomic region properly !!!

Open yinmou21 opened this issue 2 years ago • 1 comments

Hi Hannes,

I encountered a BUG when using GraphTyper2 to genotype SV in the wheat genome. The wheat genome is about 15Gb. If the genome size is too large (more than about 4.3Gb), some chromosomes will not be genotyped properly!

I tested whether the format of the chromosome ID, the number of chromosomes, and the version of the software (all versions from V2.6 to V2.7) caused some chromosomes to fail to genotype properly. Unfortunately, the format of the chromosome ID, the number of chromosomes and the version of the software do not result in normal genotyping. After repeated testing, I found it was due to the size of the genome. I suspect that the third column of the reference genome index file ( .fai ) adds up to about 4G, causing some chromosomes to fail to genotype properly. The details of this error are as follows:

I genotyped with the following command:

graphtyper genotype_sv TaCs42IwgscRefV1_splitChr.fa (ref genome) mergedWholePop.vcf.gz (the merged SV by svimmer) --sams=bam.list --threads=10 --region=chr3A_part1:1-454103970

Index the contents of the TaCs42IwgscRefV1_splitChr.fa.fai file:

chr1A_part1     471304005       13      471304005       471304006
chr1A_part2     122798051       471304032       122798051       122798052
chr1B_part1     438720154       594102097       438720154       438720155
chr1B_part2     251131716       1032822265      251131716       251131717
chr1D_part1     452179604       1283953995      452179604       452179605
chr1D_part2     43273582        1736133613      43273582        43273583
chr2A_part1     462376173       1779407209      462376173       462376174
chr2A_part2     318422384       2241783396      318422384       318422385
chr2B_part1     453218924       2560205794      453218924       453218925
chr2B_part2     348037791       3013424732      348037791       348037792
chr2D_part1     462216879       3361462537      462216879       462216880
chr2D_part2     189635730       3823679430      189635730       189635731
chr3A_part1     454103970       4013315174      454103970       454103971
chr3A_part2     296739669       4467419158      296739669       296739670
chr3B_part1     448155269       4764158841      448155269       448155270
chr3B_part2     382674495       5212314124      382674495       382674496
chr3D_part1     476235359       5594988633      476235359       476235360
chr3D_part2     139317064       6071224006      139317064       139317065
chr4A_part1     452555092       6210541084      452555092       452555093
chr4A_part2     292033065       6663096190      292033065       292033066
chr4B_part1     451014251       6955129269      451014251       451014252
chr4B_part2     222603248       7406143534      222603248       222603249
chr4D_part1     451004620       7628746796      451004620       451004621
chr4D_part2     58852447        8079751430      58852447        58852448
chr5A_part1     453230519       8138603891      453230519       453230520
chr5A_part2     256543224       8591834424      256543224       256543225
chr5B_part1     451372872       8848377662      451372872       451372873
chr5B_part2     261776885       9299750548      261776885       261776886
chr5D_part1     451901030       9561527447      451901030       451901031
chr5D_part2     114179647       10013428491     114179647       114179648
chr6A_part1     452440856       10127608152     452440856       452440857
chr6A_part2     165638404       10580049022     165638404       165638405
chr6B_part1     452077197       10745687440     452077197       452077198
chr6B_part2     268911281       11197764651     268911281       268911282
chr6D_part1     450509124       11466675946     450509124       450509125
chr6D_part2     23083594        11917185084     23083594        23083595
chr7A_part1     450046986       11940268692     450046986       450046987
chr7A_part2     286659250       12390315692     286659250       286659251
chr7B_part1     453822637       12676974956     453822637       453822638
chr7B_part2     296797748       13130797607     296797748       296797749
chr7D_part1     453812268       13427595369     453812268       453812269
chr7D_part2     184873787       13881407651     184873787       184873788
chrUn   480980714       14066281446     480980714       480980715

Chromosomal regions requiring genotyping: image The output results: image

image

Although the information in the ALT column is consistent and the information in the POS column is regular, the chromosome ID and END information in the INFO column are incorrect in the output. When the value of the third column in the index file accumulates to approximately 4Gb (when the genome size accumulates to more than 4G), the subsequent chromosomes will fail to produce the correct results. Chromosome ID and POS column information will be reset.

At present, the solutions to this problem are as follows: 1) Convert the chromosome ID, POS column information, and INFO column END information correctly; 2) Or switch the order of chromosomes and construct multiple REFERENCE genome index files. For each constructed reference genome index, only the first few chromosomes are genotyped each time.

A better idea, of course, is to modify the software source code so that larger genomes can be used as well. I hope you can notice this problem and give positive feedback in time.

Thank you!

yinmou21 avatar Apr 26 '22 05:04 yinmou21

Hi sorry for the late response, I have been on a long leave.

It is very possible that there are some overflows going on when working with >4GB genomes. We have no such genomes in decode so it hasn't been important for us to support those.

But is my understanding correct that the problem is only with translocations? And simple SVs like deletions and duplications are ok?

hannespetur avatar Dec 01 '22 13:12 hannespetur