graphtyper
graphtyper copied to clipboard
The genome size is too large to genotype for some genomic region properly !!!
Hi Hannes,
I encountered a BUG when using GraphTyper2 to genotype SV in the wheat genome. The wheat genome is about 15Gb. If the genome size is too large (more than about 4.3Gb), some chromosomes will not be genotyped properly!
I tested whether the format of the chromosome ID, the number of chromosomes, and the version of the software (all versions from V2.6 to V2.7) caused some chromosomes to fail to genotype properly. Unfortunately, the format of the chromosome ID, the number of chromosomes and the version of the software do not result in normal genotyping. After repeated testing, I found it was due to the size of the genome. I suspect that the third column of the reference genome index file ( .fai ) adds up to about 4G, causing some chromosomes to fail to genotype properly. The details of this error are as follows:
I genotyped with the following command:
graphtyper genotype_sv TaCs42IwgscRefV1_splitChr.fa (ref genome) mergedWholePop.vcf.gz (the merged SV by svimmer) --sams=bam.list --threads=10 --region=chr3A_part1:1-454103970
Index the contents of the TaCs42IwgscRefV1_splitChr.fa.fai file:
chr1A_part1 471304005 13 471304005 471304006
chr1A_part2 122798051 471304032 122798051 122798052
chr1B_part1 438720154 594102097 438720154 438720155
chr1B_part2 251131716 1032822265 251131716 251131717
chr1D_part1 452179604 1283953995 452179604 452179605
chr1D_part2 43273582 1736133613 43273582 43273583
chr2A_part1 462376173 1779407209 462376173 462376174
chr2A_part2 318422384 2241783396 318422384 318422385
chr2B_part1 453218924 2560205794 453218924 453218925
chr2B_part2 348037791 3013424732 348037791 348037792
chr2D_part1 462216879 3361462537 462216879 462216880
chr2D_part2 189635730 3823679430 189635730 189635731
chr3A_part1 454103970 4013315174 454103970 454103971
chr3A_part2 296739669 4467419158 296739669 296739670
chr3B_part1 448155269 4764158841 448155269 448155270
chr3B_part2 382674495 5212314124 382674495 382674496
chr3D_part1 476235359 5594988633 476235359 476235360
chr3D_part2 139317064 6071224006 139317064 139317065
chr4A_part1 452555092 6210541084 452555092 452555093
chr4A_part2 292033065 6663096190 292033065 292033066
chr4B_part1 451014251 6955129269 451014251 451014252
chr4B_part2 222603248 7406143534 222603248 222603249
chr4D_part1 451004620 7628746796 451004620 451004621
chr4D_part2 58852447 8079751430 58852447 58852448
chr5A_part1 453230519 8138603891 453230519 453230520
chr5A_part2 256543224 8591834424 256543224 256543225
chr5B_part1 451372872 8848377662 451372872 451372873
chr5B_part2 261776885 9299750548 261776885 261776886
chr5D_part1 451901030 9561527447 451901030 451901031
chr5D_part2 114179647 10013428491 114179647 114179648
chr6A_part1 452440856 10127608152 452440856 452440857
chr6A_part2 165638404 10580049022 165638404 165638405
chr6B_part1 452077197 10745687440 452077197 452077198
chr6B_part2 268911281 11197764651 268911281 268911282
chr6D_part1 450509124 11466675946 450509124 450509125
chr6D_part2 23083594 11917185084 23083594 23083595
chr7A_part1 450046986 11940268692 450046986 450046987
chr7A_part2 286659250 12390315692 286659250 286659251
chr7B_part1 453822637 12676974956 453822637 453822638
chr7B_part2 296797748 13130797607 296797748 296797749
chr7D_part1 453812268 13427595369 453812268 453812269
chr7D_part2 184873787 13881407651 184873787 184873788
chrUn 480980714 14066281446 480980714 480980715
Chromosomal regions requiring genotyping:
The output results:
Although the information in the ALT column is consistent and the information in the POS column is regular, the chromosome ID and END information in the INFO column are incorrect in the output. When the value of the third column in the index file accumulates to approximately 4Gb (when the genome size accumulates to more than 4G), the subsequent chromosomes will fail to produce the correct results. Chromosome ID and POS column information will be reset.
At present, the solutions to this problem are as follows: 1) Convert the chromosome ID, POS column information, and INFO column END information correctly; 2) Or switch the order of chromosomes and construct multiple REFERENCE genome index files. For each constructed reference genome index, only the first few chromosomes are genotyped each time.
A better idea, of course, is to modify the software source code so that larger genomes can be used as well. I hope you can notice this problem and give positive feedback in time.
Thank you!
Hi sorry for the late response, I have been on a long leave.
It is very possible that there are some overflows going on when working with >4GB genomes. We have no such genomes in decode so it hasn't been important for us to support those.
But is my understanding correct that the problem is only with translocations? And simple SVs like deletions and duplications are ok?