svim
svim copied to clipboard
using svim alignment to genotype, all the tandem duplications show missing value
hello! using svim alignment to genotype, all the tandem duplications show missing value.
- generated simulated data by VISOR LASeR, here is my input: 1 122200 125000 tandem duplication 2 10 1 126200 128500 tandem duplication 2 10 and then i got 50× aligned dataset (bam)
- using SVIM to call SV svim alignment dup_svim DUP50.bam pathway/ref/human_hs37d5.chr1.fasta --min_sv_size 50 --minimum_depth 1 --minimum_score 1 --types DUP:TANDEM
- variants.vcf CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample 1 122204 svim.DUP_TANDEM.3 N DUP:TANDEM 4 PASS SVTYPE=DUP:TANDEM;END=125002;SVLEN=2799;SUPPORT=4;STD_SPAN=4.2;STD_POS=4.09 GT:CN:DP:AD ./.:2:.:.,. 1 126200 svim.DUP_TANDEM.4 N DUP:TANDEM 42 PASS SVTYPE=DUP:TANDEM;END=128501;SVLEN=2302;SUPPORT=34;STD_SPAN=0.72;STD_POS=0.36 GT:CN:DP:AD ./.:2:.:.,.
- why these two dup.tandem are missing value (./.)? then, with inputting the same bam , other softwares show 0/1 or 0/0 or 1/1. Thank you very much!
Hi,
thanks for your question. The reason why genotypes are missing is that SVIM is not capable of genotyping tandem duplications at the moment. While genotyping of deletions, inversions, insertions and interspersed duplications is pretty straightforward, tandem duplications are a bit more challenging.
In theory, duplicated regions can have different copy numbers on the two parental haplotypes (e.g. 3 copies on the maternal haplotype and 2 copies on the paternal haplotype). SVIM analyzes characteristic signatures in the read alignments to detect tandem duplications. In particular, split reads where different parts map to the same region of the reference genome (the duplicated region) indicate a tandem duplication. For each read, SVIM obtains a copy number (e.g. 2 when the read covers the reference region twice). Note that this copy number is a only lower limit, because we cannot be sure that the read covers all additional copies. If a read covers only the first 2 of 5 copies, we will not see that from the read alone.
Now the question is: how do we genotype the tandem duplications SVIM has found?
In a first step, we can count the reads that cover the entire region without any duplication. Let's call them R
because they match the reference genome (wildtype allele). Additionally, we have the (lower limit) copy numbers from all reads carrying a duplication in the region (variant alleles). Let's store them in the array C[m] = n
where n is the number of reads containing m copies of the region.
In the most simple scenario, all reads have the same copy number m
: Then, we can compare C[m]
with R
and call a homozygous variant when C[m] / (C[m] + R) > 0.8
. Alternatively, we call a heterozygous variant.
In a more difficult scenario, we might observe different copy numbers in different reads. In that case, it might make sense to first compare R
with sum(C[m] for all m)
to decide whether a substantial fraction of reads carries the wildtype allele. If not, we could compare the two m
with the largest C[m]
to distinguish homozygous and heterozygous variants.
Do you think this kind of approach would make sense? If yes, I could implement it over the next few days so that you can obtain genotypes for your data.
Cheers David