SpliceAI-lookup icon indicating copy to clipboard operation
SpliceAI-lookup copied to clipboard

Splicing models beyond SpliceAI

Open ZhiyuanChen opened this issue 6 months ago • 7 comments

Hi,

As AI4Bio has attracted increasing attention, many transformer-based foundation models have been fine-tuned for splicing site predictions and are achieving satisfactory results.

Can I ask if you have plans on incorporating additional models in the SpliceAI-lookup site?

I have been developing MultiMolecule, where we hosts many foundation models in our model hub. And we have implemented a secondary structure prediction pipeline that enables inference on RNA secondary structure in just a few lines like:

import multimolecule as mm
from transformers import pipeline

predictor = pipeline("rna-secondary-structure", model="multimolecule/ernierna-ss")
output = predictor("AUCG")

We have recently incorporated the SpliceAI model in our model hub, and I think we could make a similar pipeline for splicing site predictions.

ZhiyuanChen avatar May 20 '25 12:05 ZhiyuanChen

Have these new models been published and shown to improve splicing prediction accuracy and/or variant interpretation?

Secondary structure prediction is interesting, but AFAIK it's importance would be difficult to interpret in most cases (eg. when does it matter if a variant changes the secondary structure). Are there papers that show otherwise?

Finally, does your model hub only host the weights or do you also have API endpoints for performing inference?

bw2 avatar May 20 '25 13:05 bw2

Some of them are published, some of them are still under review (I know two of them are reviewed for over a year, they take ages). We hosts 12 models in their 23 variants, and we are constantly adding more (we are working on 3 other models). You can find a full list of these models on our model hub.

In our internal testing, some models can achieve 10% improvements in terms of AUPRC when compared against SpliceAI (0.5963366 vs 0.47926119). We do this testing to provide a fair benchmark for all models since not all of them are evaluated under the same settings.

No, most papers are focusing on models and results. And tbh, I don't really believe in model interpretation as several studies, like Attention is not Explanation have shown that the hidden states are not explainable.

At this stage, the model hub only hosts the weights. MultiMolecule is my personal project as of now, and inference endpoints are too expensive. We do have plan to provide onnx support so that these models can be run in browsers with Onnx Runtime Web, but I'm not sure if it's what users want.

ZhiyuanChen avatar May 20 '25 13:05 ZhiyuanChen

We are unlikely to add models until they're published (along with benchmarks) and gain some traction in the community.

Also, in the SpliceAI paper, the reported AUPRC is 0.98, so I'm curious what benchmark you are using to get 0.479. This is from Figure 1E: Image

Last thing - by interpretation I meant variant interpretation rather than generic AI model interpretability.

bw2 avatar May 21 '25 01:05 bw2

until they're published

As I said, many of them are published

and gain some traction in the community

On average, models are downloaded ~8k times per month, with the highest downloaded 24k times a month. Can I ask if there is any definition on "traction"? I'm asking this only because I want to know when should I create another issue, if I ever need to create one.

in the SpliceAI paper, the reported AUPRC is 0.98

Yes, their reported results look impressive. But that's highly because of the dataset.

Our benchmark dataset is closer to the DeltaSplice, see figure 4 from DeltaSplice.

Image

ZhiyuanChen avatar May 21 '25 07:05 ZhiyuanChen

Interesting. I was not aware of DeltaSplice. What are the other published models?

bw2 avatar May 21 '25 12:05 bw2

Interesting. I was not aware of DeltaSplice. What are the other published models?

I haven't tested them yet. I'm currently working on another paper, so it might take some time for me to test them. I'll get back to you once I have the results.

ZhiyuanChen avatar May 21 '25 18:05 ZhiyuanChen

I understand that there are concerns with the new transformer-based models. Many of them are not published, and we need their trained weights for model conversion, which many of them do not provide.

Maybe it's possible for us to start wish published works like CI-SpliceAI and DeltaSplice?

The goal of MultiMolecule's inference pipelines is to provide a unified API for a certain task. And I really think splicing is an important task we should be working on.

ZhiyuanChen avatar May 22 '25 07:05 ZhiyuanChen