bonito icon indicating copy to clipboard operation
bonito copied to clipboard

Question about model training

Open kuanchiun opened this issue 2 years ago • 4 comments

Hello!

I would like to build the custom model using bonito because some repeat sequences have basecalling errors in the default DNA model. And I have two questions.

  1. I noticed that in bonito training guidelines do not mention how we set validation dataset in the bonito train, here is the command which I use to train a custom model.

bonito basecaller [email protected] --save-ctc --reference chm13_1.1/chm13.draft_v1.1_breakline.fasta fast5_data > training/ctc_data/basecalls.sam bonito train --epochs 50 --lr 1e-5 --batch 30 --pretrained [email protected] --directory training/ctc_data training/model_ver1

and I got an output message like this. validation missing

Is there any parameter in bonito train that can let us set our validation dataset?

  1. And after generating the custom model, I use another dataset generated in the same way to verify it. I found that basecalling errors in the repeat sequences are corrected, but another problem appears.

Here are the results in which I use porechop to remove adapters in the same sample. Porechop command : porechop -i sample.fastq -o sample_trim.fastq --adapter_threshold 55 --middle_threshold 85 --end_threshold 55 --end_size 50 --extra_end_trim 0 -t 32 -v 3 > sample_porechop.txt

Guppy with the default model porechop in guppy default model

Bonito with the default model porechop in default model in bonito

Bonito with the custom model porechop in training model in bonito

I found that compared with Guppy and bonito's default model, the adapter counts which porechop detect dramatically decrease in the custom model.

My guess is that because the Bonito train is based on the reference genome, and adapter sequences do not exist in the reference genome, which causes the error training result in the custom model.

Is there any method that can help us to avoid this problem? Should I remove the adapter sequence signal in fast5 file?

Thanks, Ken Tung

kuanchiun avatar Aug 20 '21 04:08 kuanchiun

Hey @89213385

  1. You can use a specific set of reads as a validation ctc set for training by saving them inside the training directory in a validation folder (https://github.com/nanoporetech/bonito/blob/master/bonito/cli/train.py#L41).
bonito basecaller [email protected] --save-ctc --reference ref.fasta f5_train > train/ctc_data/calls.sam
bonito basecaller [email protected] --save-ctc --reference ref.fasta f5_valid > train/ctc_data/validation/calls.sam
  1. Yes, care must be taken around the boundaries of ctc chunks as not to introduce such edge effects. Whilst the --save-ctc workflow makes it easy to train and fine tune models it's not the most robust in this regard. You can try turning up the default coverage requirement for a chunk with --ctc-min-coverage, the default is 90%, maybe 98/99% would help https://github.com/nanoporetech/bonito/blob/master/bonito/cli/basecaller.py#L99.
bonito basecaller [email protected] --ctc-min-coverage 0.99 --save-ctc --reference ref.fasta f5_train > train/ctc_data/calls.sam

HTH

Chris.

iiSeymour avatar Aug 24 '21 22:08 iiSeymour

Hi @iiSeymour

Thank you for your reply, I will try to add my validation ctc set in the bonito train command.

But for question 2, actually, I have no experience in model training before. I'm not very sure how does --ctc-min-coverage affects model training. Where should I find information about --ctc-min-coverage?

Thanks, Ken Tung.

kuanchiun avatar Aug 29 '21 10:08 kuanchiun

The --ctc-min-coverage checks the called sequence against the target sequence. Low coverage would mean samples in signal would not be assigned to any bases.

https://github.com/nanoporetech/bonito/blob/503fa9a4ff445c40ecff2db7a0308ecc2838a77e/bonito/io.py#L414

iiSeymour avatar Sep 14 '21 10:09 iiSeymour

Hi @iiSeymour

Thanks for your reply, I tried to use --ctc-min-coverage and got a better custom model.

In addition, I have another question. Actually, I tried to estimate the length of certain repeat sequences. When I use the default model in both guppy and bonito to do basecalling, I found that those repeat sequences have some errors. e.g. CCCTAA -> TGGCC.

And I tried to use CHM13 as our reference genome and train custom model, but for those repeat sequences in my sample, the length is much longer than the repeat sequences in CHM13, as shown below. bonito_ask

And I think that for bonito, there are some repeat sequences could not be fully mapped on the reference genome.

Will this situation cause errors in the calculation of CTC data?

Thanks, Ken Tung

kuanchiun avatar Sep 16 '21 02:09 kuanchiun