bonito icon indicating copy to clipboard operation
bonito copied to clipboard

Short reads and few repeats called with Bonito in non-model PromethION data

Open andreaswallberg opened this issue 3 years ago • 1 comments

Dear developers,

Following a brief discussion on Twitter, I will here present some stats from comparing Bonito 0.3.2 against Guppy 4.2.2 (HAC and Fast) basecalls using an Nvidia RTX 3090 on Ubuntu 20.04 and a small batch of Nanopore PromethION data produced from a non-model organism with a large and repetitive genome.

A bit of background: this data was generated after performing nuclease flush of a flow cell, and loading a second round of DNA from the same original DNA extraction. This particular re-run of a flow cell was not highly successful: it generated only ~500Mbp with an average reported read Q-VALUE of 8.7, but was part of our learning process as we tried to find a strategy to optimize sequence yield from our organism. I picked this particular archive mostly because it was small and convenient for testing.

We already knew that this genome has low GC content (~31% GC) and we expect it to have many microsatellites and other tandem repeats. It is an unusual genome compared to those typically at the center of sequence methods development.

I will report on the installation process and performance of the tools elsewhere and here just focus on some curious differences detected in the base composition between Bonito and Guppy.

The stats were computed with some basic Perl scripts or using the venerable but brilliant SciRoKo 3.4 microsatellite search tool https://kofler.or.at/bioinformatics/SciRoKo/ . The GUI version of SciRoko produces useful motif statistics and can be run using Mono in Linux and scans the FASTA sequences in less than a minute.

None of the data has here been analyzed from mapping against a reference. AFAIK, adapters/barcodes have not been processed.

For basecalling, I used the following commands:

1. Bonito: bonito basecaller --recursive --fastq dna_r9.4.1 OP005_019_190411_NIW > OP005_019_190411_NIW.fastq

2. Guppy (HAC): guppy_basecaller -i OP005_019_190411_NIW -s guppy_4.2.2 -c ../data/dna_r9.4.1_450bps_hac_prom.cfg --device 'auto' --recursive

3. Guppy (Fast) guppy_basecaller -i OP005_019_190411_NIW -s guppy_4.2.2_fast -c ../data/dna_r9.4.1_450bps_fast_prom.cfg --device 'auto' --recursive

I do not pretend to understand all of the inner workings of Nanopore basecalling but hope that the developers and community can shed some light on the results and perhaps find it useful in the development of the basecallers!

Results:

1. Bonito

- Number of reads (n): 74,536
- Total bp produced (bp): 471,708,033
- Mean read length (bp): 6,329
- N50 read length (bp): 13,024
- Max read length (bp): 163,544
- GC (%): 31.3
- Simple Sequence Repeats (SSR) (%) (SciRoKo): 3.86
- Average SSR length (bp) (SciRoKo): 62
- Average mismatches (SciRoKo) (*): 1.31

'*' Not sure if the unit is % or n mismatches over average length (1.31 / 62)

2. Guppy (HAC):

- Number of reads (n): 74,536
- Total bp produced (bp): 538,306,643
- Mean read length (bp): 7,222
- N50 read length (bp): 13,621
- Max read length (bp): 164,272
- GC (%): 31.1
- Simple Sequence Repeats (SSR) (%) (SciRoKo): 6.2
- Average SSR length (bp) (SciRoKo): 91
- Average mismatches (SciRoKo) (*): 2.70

2. Guppy (Fast):

- Number of reads (n): 74,536
- Total bp produced (bp): 540,124,270
- Mean read length (bp): 7,246
- N50 read length (bp): 13,340
- Max read length (bp): 239,338
- GC (%): 32.0
- Simple Sequence Repeats (SSR) (%) (SciRoKo): 5.2
- Average SSR length (bp) (SciRoKo): 70
- Average mismatches (SciRoKo) (*): 2.77

A few observations worth noting from the stats:

  • Bonito produces as many reads from the FAST5 data as Guppy but about 12% less base pairs, and reads are shorter.
  • Bonito appear to produce less simple sequence repeats overall and shorter motifs. It appears to have fewer errors / mismatches within the repeats that the data basecalled by Guppy.
  • Guppy HAC produces the most and longest repeats our of the three methods. Guppy Fast has markedly higher %GC and a very long read that was not called as such by the other methods.

Some figures and observations:

  • Bonito produces an excess of very short reads (0-2kbp) from our data compared to Guppy.
  • Bonito produces shorter and less dinucleotide repeats (which make up the vast majority of detected repeats), in particular of the common "AT" microsatellite motif.

image

andreaswallberg avatar Dec 07 '20 23:12 andreaswallberg

Hey @andreaswallberg

This detailed analysis was really helpful - the issues you highlighted should be resolved with the [email protected] model that was just release in v0.3.7.

iiSeymour avatar Apr 21 '21 09:04 iiSeymour