bonito
bonito copied to clipboard
Question about model training
Hello!
I would like to build the custom model using bonito because some repeat sequences have basecalling errors in the default DNA model. And I have two questions.
- I noticed that in bonito training guidelines do not mention how we set validation dataset in the
bonito train
, here is the command which I use to train a custom model.
bonito basecaller [email protected] --save-ctc --reference chm13_1.1/chm13.draft_v1.1_breakline.fasta fast5_data > training/ctc_data/basecalls.sam
bonito train --epochs 50 --lr 1e-5 --batch 30 --pretrained [email protected] --directory training/ctc_data training/model_ver1
and I got an output message like this.
Is there any parameter in bonito train
that can let us set our validation dataset?
- And after generating the custom model, I use another dataset generated in the same way to verify it. I found that basecalling errors in the repeat sequences are corrected, but another problem appears.
Here are the results in which I use porechop to remove adapters in the same sample.
Porechop command : porechop -i sample.fastq -o sample_trim.fastq --adapter_threshold 55 --middle_threshold 85 --end_threshold 55 --end_size 50 --extra_end_trim 0 -t 32 -v 3 > sample_porechop.txt
Guppy with the default model
Bonito with the default model
Bonito with the custom model
I found that compared with Guppy and bonito's default model, the adapter counts which porechop detect dramatically decrease in the custom model.
My guess is that because the Bonito train is based on the reference genome, and adapter sequences do not exist in the reference genome, which causes the error training result in the custom model.
Is there any method that can help us to avoid this problem? Should I remove the adapter sequence signal in fast5 file?
Thanks, Ken Tung
Hey @89213385
- You can use a specific set of reads as a validation ctc set for training by saving them inside the training directory in a
validation
folder (https://github.com/nanoporetech/bonito/blob/master/bonito/cli/train.py#L41).
bonito basecaller [email protected] --save-ctc --reference ref.fasta f5_train > train/ctc_data/calls.sam
bonito basecaller [email protected] --save-ctc --reference ref.fasta f5_valid > train/ctc_data/validation/calls.sam
- Yes, care must be taken around the boundaries of ctc chunks as not to introduce such edge effects. Whilst the
--save-ctc
workflow makes it easy to train and fine tune models it's not the most robust in this regard. You can try turning up the default coverage requirement for a chunk with--ctc-min-coverage
, the default is 90%, maybe 98/99% would help https://github.com/nanoporetech/bonito/blob/master/bonito/cli/basecaller.py#L99.
bonito basecaller [email protected] --ctc-min-coverage 0.99 --save-ctc --reference ref.fasta f5_train > train/ctc_data/calls.sam
HTH
Chris.
Hi @iiSeymour
Thank you for your reply, I will try to add my validation ctc set in the bonito train command.
But for question 2, actually, I have no experience in model training before. I'm not very sure how does --ctc-min-coverage
affects model training.
Where should I find information about --ctc-min-coverage
?
Thanks, Ken Tung.
The --ctc-min-coverage
checks the called sequence against the target sequence. Low coverage would mean samples in signal would not be assigned to any bases.
https://github.com/nanoporetech/bonito/blob/503fa9a4ff445c40ecff2db7a0308ecc2838a77e/bonito/io.py#L414
Hi @iiSeymour
Thanks for your reply, I tried to use --ctc-min-coverage
and got a better custom model.
In addition, I have another question. Actually, I tried to estimate the length of certain repeat sequences. When I use the default model in both guppy and bonito to do basecalling, I found that those repeat sequences have some errors. e.g. CCCTAA -> TGGCC.
And I tried to use CHM13 as our reference genome and train custom model, but for those repeat sequences in my sample, the length is much longer than the repeat sequences in CHM13, as shown below.
And I think that for bonito, there are some repeat sequences could not be fully mapped on the reference genome.
Will this situation cause errors in the calculation of CTC data?
Thanks, Ken Tung