bonito Fewer total bases called when using bonito models compared to default guppy models

Fewer total bases called when using bonito models compared to default guppy models

Open danrdanny opened this issue 3 years ago • 17 comments

Hi all, I've been working on understanding why my average coverage has been lower after switching to basecalling using bonito v031 models. I am curious if other users are seeing the same issue. This is consistent for mouse and human samples, and across ten or so samples I've checked. If this is a guppy issue, I am happy to post in the community.

Briefly, the total number of bases called for any experiment is relatively similar when I compare output from guppy 3.6.1, 4.0.11, and 4.4.0 using the default config file, but drops by ~15% when using the bonito 031 config files. The total number of reads is always the same regardless of config used. The data is from an adaptive sampling experiment in a human, thus the average read length is low.

I can give more examples, but the percentage drop in total bases called will be the same. Thoughts, suggestions? Thanks in advance.

Feb 14 '21 02:02 danrdanny

Hey @danrdanny - have you also compared aligned lengths?

Feb 14 '21 11:02 iiSeymour

Really good question @iiSeymour. I hadn't, but here's a summary for two samples. Both basecalled with guppy 4.4.0, one with the default model and the other with bonito. Aligned using minimap2 to hg38. I used samtools view -F 4 to only look at mapped reads. These are both adaptive sampling experiments, so the huge difference between all bases and >1kb is expected. Interesting that the percent bases lost isn't linear.

Should I be looking at this in a different way? Anyone able to replicate this?

Feb 14 '21 23:02 danrdanny

If it's helpful here's a plot of read lengths from two fastq files, one called with guppy 4.4.0 and the other with guppy 4.4.0 using bonito 0.3.1 model. Sample is the same, this is adaptive sampling data. I cut the x-axis off at 1kb. There's a left shift but also a peak of read lengths <100bp.

Feb 16 '21 18:02 danrdanny

We tried the CRF models on Guppy and we're observing the same phenomenon with viral amplicons. Even though the number of reads is identical, the read length of the resulting basecalls is around 40% shorter. Table below based on seqkit stats results:

Feb 17 '21 20:02 gallardo-seq

@danrdanny @gallardo-seq I've failed to find datasets to reproduce this so far but I do have some new models which may help. Are either of you able to test these in bonito or do you need a guppy/rerio model?

Feb 23 '21 17:02 iiSeymour

@iiSeymour thanks. I would prefer a guppy model as bonito frequently crashes my system. I can also share fast5 data with you if you want to see if you can replicate what I'm seeing.

Feb 23 '21 17:02 danrdanny

A fast5 file would really help thanks - https://nanoporetech.ent.box.com/f/7c4375e2b71b48258ebed29f198b89ab

Feb 23 '21 17:02 iiSeymour

@iiSeymour I saw that a new version of bonito is up with new models, I will go ahead and try running my test dataset again to see if I can replicate the issue.

Feb 24 '21 18:02 gallardo-seq

hi @iiSeymour ...do you by any chance have a guppy/rerio (compatible) model for the newest bonito model ([email protected])?

Feb 24 '21 21:02 Sumsarium

hi @iiSeymour ...do you by any chance have a guppy/rerio (compatible) model for the newest bonito model ([email protected])?

Would also like this. I am about to embark on recalling lots of old runs and if the update models are to be transferred to rerio repository soon I will wait until then

Feb 27 '21 14:02 callumparr

@Sumsarium @callumparr models pushed to rerio this morning.

Mar 01 '21 11:03 iiSeymour

Screen Shot 2021-03-22 at 6 20 19 PM

@iiSeymour I tried the new guppy CRF model (v032) and there's a noticeable improvement in the base yield (sum_len) and N50 values.

For transparency purposes, the length of the resulting reads is important for us since we're doing a concatemer-based approach for error correction of single molecule reads (https://www.biorxiv.org/content/10.1101/2021.01.27.428469v1.full). I'll put these new crf_v032 basecalls through our pipeline, hopefully we can see improvements on the other end.

Also, this might be a question for @cjw85, but are there any medaka models available for crf_v032 guppy basecalls?

Mar 23 '21 01:03 gallardo-seq

The new models in v0.3.7 do a much better job here - thanks @danrdanny for the dataset, it was really helpful.

Apr 21 '21 10:04 iiSeymour

somewhat related. I ran some data set through guppy 4.4.1 (I think) with either flipflop HAC or using rerio directory for bonito crf32. There seem to be some disagreement in reads below a certain length and large spike in reads of 1 or few nucleotides long. From bioanalyzer of the prepared cDNA library I was expected quite bump at 300nt mark as seen in the read length distribution with flipflop but this is dramatically reduced with bonito

Does bonito not do well with small reads like older guppy versions?

Unknown

Full NanoComp stats of the FASTQs is here

NanoStats.txt

Apr 22 '21 14:04 callumparr

@callumparr, I think you should try with the 3.3 models in bonito 0.3.7. I'm basecalling now and will also compare.

Apr 22 '21 15:04 danrdanny

Hi everyone, I wanted to let you know that I re-ran my samples with the CRF v3.3 model, as you can see the lengths are now comparable with Guppy (great news!). The only thing I wanted to ask you is about the quality scores, which have dropped down significantly. This was using Guppy 4.5.4 with the bonito 0.3.7 model exported .jsn file (thanks for the instructions), my guess is the quality scores are not calibrated, is this the case? Thanks!

Apr 29 '21 01:04 gallardo-seq

@gallardo-seq that looks much better, thanks for reporting. Yes, that is correct, it's not calibrated correctly.

Apr 29 '21 08:04 iiSeymour

The short read performance in bonito v0.6 is now much improved.

Sep 05 '22 15:09 iiSeymour

bonito bonito copied to clipboard

Fewer total bases called when using bonito models compared to default guppy models

bonito
bonito copied to clipboard