SpliceAI-lookup Issue with duplications of the intron/exon boundary

Hello, I have found that the variant NM_000179.3:c.261-2_263dup (hgvs nomenclature) may not be correctly analyzed by SpliceAI-lookup. (https://spliceailookup.broadinstitute.org/#variant=NM_000179.3%3Ac.261-2_263dup&hg=38&distance=500&mask=0&ra=1) Looking at it manually based on consensus sequence at acceptor sites, it seems straightforward that the variant should be predicted to cause the loss of the native acceptor site and gain of an acceptor site resulting in the extension of the exon by 5 bp.

But for c.261-2_263dup SpliceAI-lookup shows ~ no splicing change (ran unmasked, 500 bp window).

However, as this indel is “scootable” it can be represented by several different c. including: c.261-3_261-2insAGTTG (https://spliceailookup.broadinstitute.org/#variant=NM_000179.3%3Ac.261-3_261-2insAGTTG&hg=38&distance=500&mask=0&ra=1): Here SpliceAI-Lookup does find the change that is expected from manually looking at the sequence and comparing it to the expected splice consensus sequence: loss of the native acceptor site (delta 0.98) and gain of an acceptor site (delta 0.98) (this would lead to an extension of the exon by 5 bp, although the exact length of the extension cannot be deduced from the viewer or the distance in the table because both are only with respect to the reference sequence and this is an insertion.)

c.263_264insAGTTG (https://spliceailookup.broadinstitute.org/#variant=NM_000179.3%3Ac.263_264insAGTTG&hg=38&distance=500&mask=0&ra=1) is another equivalent c. nomenclature that like c.261-2_263dup gives no splicing change predicted.

Would it be possible to look into this issue and identify what is causing it and what kind of variants are impacted by it? Could it be a problem affecting all insertions? Thanks so much for taking a look.

Mar 02 '24 00:03 SophieCandille

Thanks for reporting this issue.

SpliceAI-lookup uses Ensembl APIs to convert HGVS notation to chrom-pos-ref-alt before passing it to the SpliceAI model, so I've changed it to also show the intermediate chrom-pos-ref-alt representation in the results:

In this case, I see that NM_000179.3:c.263_264insAGTTG is converted to chr2:47790929 G>GAGTTG which leads to the following alt sequence (starting from chr2:47,790,921, and with | delineating the inserted bases): ...CAACAGTTG|AGTTG|TGACTTCTC...

while NM_000179.3:c.261-3_261-2insAGTTG is converted to chr2:47790924 C>CAGTTG but leads to the same alt sequence: ...CAAC|AGTTG|AGTTGTGACTTCTC...

As you mentioned, we would expect these variants to have the same splicing predictions, but due to a difference in how the ref and alt scores line up, the model currently produces very different delta scores. I can fix this inconsistency by left-aligning the variant before passing it to the model. Here,
left-alignment would convert chr2:47790929 G>GAGTTG to chr2:47790924 C>CAGTTG and produce the expected splicing predictions.
Left alignment seems like an arbitrary way to choose among the different possible representations, so it's probably just luck that in this case it gives you the scores you expect, but at least it would produce consistent results. Also, I think this type of inconsistency is currently possible for all "scootable" insertion/deletions.

Mar 03 '24 17:03 bw2

Thank you so much for looking into it so promptly. Is your sense that this is something that could be fixed in the near future? and do you think it is SpliceAI itself that is the problem or what is on top of it like the coordinate lookup or the web interface? If SpliceAI itself has an issue, the issue may not necessarily be in the prediction itself but rather in the calculation of the delta and distance because this kind of situation may not have been anticipated. If SpliceAI worked correctly, it should give the same result (ie same delta and distance) no matter what equivalent genomic representation is given (because they all represent the exact same sequence).

Mar 03 '24 19:03 SophieCandille

Wow, this proved to be challenging and interesting to track down. My current understanding is that this isn't a bug, but is an inevitable consequence of assigning a single genomic reference coordinate to each delta score prediction. This isn't a problem for deletions (or for SNVs) since each base in their REF as well as their ALT sequence has an assigned position in the reference genome, but can become a problem for insertions because the ALT bases don't have assigned reference coordinates.

To make this more concrete, let's say we have a large variant - chr1:123456789 T>TAGATGA... that inserts many kilobases of sequence, including an entire new gene. This inserted sequence naturally contains many novel acceptor and donor sites. If we were to pass this variant to the SpliceAI model, it would generate the REF and ALT haplotype sequences, and separately predict donors and acceptors in each haplotype (including all the new donors and acceptors in the ALT). So far everything is fine. Now, because it chose to provide these predictions to users in the form of delta scores located at genomic positions, it has to somehow represent all the donor and acceptor scores in the ALT sequence as a single score at some reference coordinate. The way SpliceAI code does this is it takes the max score from the inserted bases and places it at the genomic position where the bases were inserted (ie. chr1:123456789). This type of collapsing discards information, and is the source of the inconsistency you are seeing.

For the variant(s) you shared, if we look at SpliceAI predictions for the REF and ALT haplotypes before delta scores are calculated:

chr2:47790929-G-GAGTTG:

                0.98   = probability that this base is an acceptor based on REF haplotype sequence
                 |
...CCTTTTGGCAACAGTTG|-----|TGACTTCTCACCAGGAGATTTGGTTT...  (REF)
...CCTTTTGGCAACAGTTG|AGTTG|TGACTTCTCACCAGGAGATTTGGTTT...  (ALT)
                 |
                0.98   = probability that this base is an acceptor based on ALT haplotype sequence
             

chr2:47790924-C-CAGTTG:
             
                       0.98  = probability that this base is an acceptor based on REF haplotype sequence
                        |                
...CCTTTTGGCAAC|-----|AGTTGTGACTTCTCACCAGGAGATTTGGTTT...  (REF)
...CCTTTTGGCAAC|AGTTG|AGTTGTGACTTCTCACCAGGAGATTTGGTTT...  (ALT)
                  |                
                 0.98  = probability that this base is an acceptor based on ALT haplotype sequence

we see that the model is predicting the same thing for both chr2:47790929-G-GAGTTG and chr2:47790924-C-CAGTTG - ie. the first GT in each sequence is predicted to be a splice acceptor, but then for chr2:47790924-C-CAGTTG, since the highest ALT score occurs within the inserted bases, SpliceAI collapses this to

                 
chr2:47790924-C-CAGTTG:
             
                 0.98  = probability that this base is an acceptor based on REF haplotype sequence
                  |                
...CCTTTTGGCAAC|AGTTGTGACTTCTCACCAGGAGATTTGGTTT...
...CCTTTTGGCAAC|AGTTGTGACTTCTCACCAGGAGATTTGGTTT...
              |                
             0.98  = probability that this base is an acceptor within the *collapsed* ALT haplotype

Pangolin works the same way, and so has the same inconsistency.

I now think this second representation is more problematic. Rather than showing the variant as shifting the splice acceptor 3bp to the left, it's more correct to show it as having no effect on splicing but rather inserting 5 bases into the exon @ chr2:47790929. So although I still want to change SpliceAI-lookup to left-align indels by default, in this case it would yield the more problematic representation. What do you think?

Mar 04 '24 03:03 bw2

I now think that, although insertions in repetitive regions make this problem more obvious (since it's common to see variants there with multiple equivalent representations), this isn't confined to repetitive regions.

As a general summary, high SpliceAI and Pangolin scores should be treated with caution for any insertion variant where the inserted bases are at least partially the same as adjacent reference sequence - like in chr2:47790924-C-CAGTTG below - because technical artifacts in score reporting can lead to false positive results.

GCAAC|-----|AGTTGTG  (REF)
GCAAC|AGTTG|AGTTGTG  (ALT)

Mar 05 '24 13:03 bw2

I agree will all above. It will be a wonderful addition to develop showing the results both with respect to the REF sequence and the ALT sequence to avoid any ambiguity and misinterpretations/misrepresentations.

Mar 07 '24 17:03 SophieCandille

A table of scores for inserted bases has now been added - for example @ https://spliceailookup.broadinstitute.org/#variant=13-21155151-C-CAGTTTTCTTTGTTGCTGACATCTCGGATGTTCTGTCCATGTTTAAGGAACCTTTTACTGGGTA&hg=38&bc=comprehensive&distance=500&mask=0&ra=1

Nov 13 '24 05:11 bw2

SpliceAI-lookup SpliceAI-lookup copied to clipboard

Issue with duplications of the intron/exon boundary

SpliceAI-lookup
SpliceAI-lookup copied to clipboard