SpliceAI-lookup icon indicating copy to clipboard operation
SpliceAI-lookup copied to clipboard

Duplications vs Insertions

Open np-clark opened this issue 10 months ago • 1 comments

We were wondering if you could explain the following issue regarding duplications of the +1 base at some canonical donor sites. For the variants NM_000179.2:c.3646+1dup and NM_000249.3:c.1558+1dup, the delta scores are all 0. Yet if you type in the variants as insertions, i.e. NM_000179.2:c.3646+1_3646+2insG and NM_000249.3:c.1558+1_1558+2insG respectively, SpliceAI gives a donor loss score of 1.00 and a donor gain of 1.00. Could you explain why this is occurring please? Do we need to enter duplication variants as insertions in future? Please see links below:

https://spliceailookup.broadinstitute.org/#variant=NM_000179.2%3Ac.3646%2B1dup&hg=37&bc=basic&distance=500&mask=0&ra=1

https://spliceailookup.broadinstitute.org/#variant=NM_000179.2%3Ac.3646%2B1_3646%2B2insG&hg=37&bc=basic&distance=500&mask=0&ra=1

https://spliceailookup.broadinstitute.org/#variant=NM_000249.3%3Ac.1558%2B1dup&hg=37&bc=basic&distance=500&mask=0&ra=1

https://spliceailookup.broadinstitute.org/#variant=NM_000249.3%3Ac.1558%2B1_1558%2B2insG&hg=37&bc=basic&distance=500&mask=0&ra=1

np-clark avatar Jan 08 '25 22:01 np-clark

Interesting question. From what I can tell, these examples show how the SpliceAI model can be very sensitive to the exact position of the variant (even when two variants at slightly different positions produce the same ALT haplotype sequence).

In more detail, NM_000249.3:c.1558+1dup and NM_000249.3:c.1558+1_1558+2insG are different ways of describing the same ALT haplotype, which contains 4 x G's instead of 3:

Starting from the chr3:37,070,419 position:  
REF: ...TGAGGGTAC...
ALT: ...TGAGGGGTAC...

However, the "dup" variant inserts the extra G at 3:37070423 while the "ins" variant inserts it at 3:37070424:

image image

This means the SpliceAI input window is shifted by 1 base to the right for the "ins" relative to the "dup". I've confirmed that, under the hood, SpliceAI converts these variants to the expected REF and ALT sequences before feeding them into the neural network. Also, I've confirmed that the same predictions are found in the original pre-computed scores generated by Illumina: if I download spliceai_scores.raw.indel.hg19.vcf.gz from https://basespace.illumina.com/analyses/194103939/files and run

tabix spliceai_scores.raw.indel.hg19.vcf.gz 3:37070422-37070425 | grep $'G\tGG'

I see high scores only where the GG insertion is placed at the 3:37070424 position:

image

In conclusion, this seems to be an error / inconsistency involving the core neural net model, rather than something about the surrounding code that reformats variants into neural net inputs or post-processes the scores.

Separately, I need to figure out why the position column is showing such a large number for the 3:37070424 G>GG result: image

bw2 avatar Jan 13 '25 19:01 bw2