bonito icon indicating copy to clipboard operation
bonito copied to clipboard

RNA model for bonito

Open CDieterich opened this issue 4 years ago • 10 comments

Dear developers,

would you be able to provide an RNA model for bonito somewhere

bonito basecaller rna_r9.4.1 /data/reads > basecalls.fasta

Thank you Christoph

CDieterich avatar Mar 20 '20 08:03 CDieterich

Hi @CDieterich

I have not trained an RNA model yet, I will update this issue if things change.

Regards

iiSeymour avatar Mar 20 '20 10:03 iiSeymour

Excellent.

CDieterich avatar Mar 20 '20 13:03 CDieterich

BTW, is there any manual to do the training myself ?

CDieterich avatar Apr 02 '20 06:04 CDieterich

Id be also interested in any documentation of using bonito train. Is it similar process to taiyaki? From what I understood from the Nanopore Community meeting when Clive gave a talk, the structure was simpler?

callumparr avatar May 03 '20 10:05 callumparr

I think it should be straightforward, if not, let me know.

First, make sure you have the training data downloaded.

$ bonito download --training

Then run bonito train and give it an output directory.

$ bonito train model-train-dir
[loading data]
[loading model]
[990000/990000]: 100%|#########################################| [1:23:46, loss=0.2546]
[epoch 1] directory=model-train-dir loss=0.2496 mean_acc=92.351% median_acc=93.035%
[990000/990000]: 100%|#########################################| [1:23:40, loss=0.2010]
[epoch 2] directory=model-train-dir loss=0.2201 mean_acc=93.310% median_acc=94.000%
[990000/990000]: 100%|#########################################| [1:23:41, loss=0.2255]
[epoch 3] directory=model-train-dir loss=0.2038 mean_acc=93.847% median_acc=94.527%
[990000/990000]: 100%|#########################################| [1:23:40, loss=0.2018]
[epoch 4] directory=model-train-dir loss=0.1964 mean_acc=94.090% median_acc=94.608%
[990000/990000]: 100%|#########################################| [1:23:32, loss=0.2001]
[epoch 5] directory=model-train-dir loss=0.1899 mean_acc=94.318% median_acc=95.025%
[990000/990000]: 100%|#########################################| [1:23:32, loss=0.1862]
[epoch 6] directory=model-train-dir loss=0.1871 mean_acc=94.383% median_acc=95.025%
[990000/990000]: 100%|#########################################| [1:23:31, loss=0.1678]
[epoch 7] directory=model-train-dir loss=0.1813 mean_acc=94.583% median_acc=95.098%
[990000/990000]: 100%|#########################################| [1:23:41, loss=0.1916]
[epoch 8] directory=model-train-dir loss=0.1793 mean_acc=94.634% median_acc=95.396%
[990000/990000]: 100%|#########################################| [1:23:34, loss=0.1865]
[epoch 9] directory=model-train-dir loss=0.1764 mean_acc=94.755% median_acc=95.500%
[990000/990000]: 100%|#########################################| [1:23:32, loss=0.1565]
[epoch 10] directory=model-train-dir loss=0.1763 mean_acc=94.737% median_acc=95.500%
[990000/990000]: 100%|#########################################| [1:23:32, loss=0.1580]
[epoch 11] directory=model-train-dir loss=0.1739 mean_acc=94.836% median_acc=95.522%
[125184/990000]:  13%|#######                                  | [10:35, loss=0.1572]

By default, the training will use 1 million chunks with a 1% validation split. You can see the progress of each epoch over the 990,000 training examples with the training loss updating for each batch. At the end of each batch, you get the validation loss and accuracy reported.

iiSeymour avatar May 03 '20 11:05 iiSeymour

I am just wondering. Does 'bonito download --training' command can downloaded all the training data that needed to train a satisfying model? Many thanks!

snower2010 avatar May 18 '20 07:05 snower2010

@snower2010 bonito download --training will give you the full training set for dna_r9.4.1 that is used to train the model shipped with bonito. I'm currently only focusing on a single condition.

HTH,

Chris.

iiSeymour avatar May 18 '20 10:05 iiSeymour

Got it! Many thanks! By the way, could you also tell me the configuration and the time cost for this specific traning. Thanks!

snower2010 avatar May 19 '20 03:05 snower2010

Ok, got back to this now..

any developments on this aspect (RNA modifications) @iiSeymour ?

I would be happy to do it myself provided that there is some documentation for training from scratch / or pretrained ?

Thank you

CDieterich avatar Sep 11 '20 09:09 CDieterich

Continuing this thread - would it be worth it if a few of us here put our heads together and attempted training a [direct] RNA model? I know there are boatloads of direct and cDNA RNA for the human NA12878 runs (https://github.com/nanopore-wgs-consortium/NA12878/tree/master/nanopore-human-transcriptome).

biobenkj avatar Dec 14 '20 19:12 biobenkj