Feasibility of training using generated data

Open william-simon opened this issue 11 months ago • 1 comments

Hello team, nice work on this project, I appreciate all of the development over the last several years on it.

I and my group are primarily generic HW DNN accelerator people and as such have much less capability in gathering sequencing data for training basecallers. I was wondering therefore about the feasibility of training a basecaller such as Dorado on synthetic signal data from Squigulator. I note that this isn't the focus of your paper, rather the downstream analysis portions, and that you also notice that the the noise, particularly amplitude noise, has an impact on basecalling accuracy with the optimum being around the experimental noise the network was trained on. This would of course imply that inversely, if one trained a network on synthetic data with either 0 noise or too much noise, test accuracy on experimental data would be sub-optimal, which is fairly obvious.

Did you experiment at all with training new basecallers using synthetic data, and if so, how did it go, and if not, do you think it would be possible, even perhaps just using synthetic to augment experimental training data to increase the training size or train on genomes one doesn't have?

Jan 27 '25 08:01 william-simon

Hello

Excellent question.

I am not too versed in DNN training, so I never tried using squigulator data (or any other realdata) for training my own NN. So I do not know the exact answer to your question but this is what I think.

Squigulator, just like any other simulator would not be able to capture all the unknown features in real-data. We simulate only what we know, based on a pore-model current level table and other known variables. Also, the quality of the pore model directly affect the quality of simulated reads. While the quality of the pore model provided by ONT for R9 chemistry was great, the newer R10 one is not that great, probably because ONT is now more into NN-based methods rather than pore-model based methods.

So if you train an NN based on squigulator, it might work very well on the data generated by squigulator for sure. But depending on how sensitive the neural network is, it may not perform that well on real-data. Again, who knows, it may work better than expected too - without actually testing I don't think I can conclude. But this would be an interesting thing to explore.

I do not know how difficult it is to train a DNN, but if this is not something too difficult, something we can give a try. I would suggest first trying with the R9 data first as the pore model is of better quality currently.

Jan 28 '25 19:01 hasindu2008