dorado
dorado copied to clipboard
0.3.x generates reads 10% longer than 0.4.x and 0.5.x???
Issue Report
Please describe the issue:
I am running benchmark to compare performance of 0.3.x, 0.4.x and 0.5.x with both 4.2 and 4.3 sup model. Since 0.3.4 doesn't support 4.3 model, so I didn't run it. Testing data I used is: s3://ont-open-data/giab_2023.05/flowcells/hg002/20230429_1600_2E_PAO83395_124388f5/ It is chosen as its throughput is quite close to what I got routinely at my work place. The command I ran is like: ./dorado basecaller -r --modified-bases-models [email protected]_5mCG_5hmCG@v2 [email protected] ./ont-open-data/giab_2023.05/hg002/20230429_1600_2E_PAO83395_124388f5/ > HG002.ubam To measure base accuracy, I aligned all the reads to the HG002v1.0.1 diploid genome which is claimed to be Q75.
Please provide a clear and concise description of the issue you are seeing and the result you expect. Observations and Questions:
- Surprisingly, 0.3.4 generates 10% longer reads than 0.4.3 and 0.5.3 and 3x more >=50K reads. I think this is more important than higher base accuracy especially for assembly and SV calling. Why does the reads getting shorter with newer dorado? Based on my experience of ONT data, there is a small percentage of reads that are concatenation of two strands of the same molecule. Does 0.4.3 and 0.5.3 has ways to combine them or cut them off such that the reads are shorter and the longer 0.3.4 reads are actually useless?
- 4.3 model performs better than 4.2 model in both 0.4.3 and 0.5.3. Most notable is that for the qs<10 reads, error rate reduces from 17% to 10%. Perhaps there is room to lower to qs cutoff?
- Interestingly, dorado 0.4.3 performs slightly better than 0.5.3 in accuracy. What happened?
Here are the numbers:
Sample | 0.3.4sup4.2 | 0.4.3sup4.2 | 0.4.3sup4.3 | 0.5.3sup4.2 | 0.5.3sup4.3 |
---|---|---|---|---|---|
throughput | 136,171,813,898 | 136,163,677,918 | 136,911,563,222 | 136,056,361,146 | 136,829,736,921 |
reads# | 6,098,647 | 6,838,511 | 6,756,725 | 6,797,810 | 6,792,511 |
Avg Read Length | 22,328.20 | 19,911.30 | 20,263.01 | 20,014.73 | 20,144.21 |
Longest Read | 1,235,205 | 1,235,107 | 1,169,268 | 1,235,562 | 1,169,627 |
N50 | 31,497 | 28,494 | 28,795 | 28,587 | 28,691 |
L50 | 1,480,764 | 1,894,854 | 1,842,836 | 1,870,156 | 1,862,148 |
>=50K bases | 31,356,856,556 | 9,152,175,262 | 12,326,674,738 | 10,350,709,902 | 11,288,671,398 |
>=100K bases | 567,342,104 | 218,324,013 | 259,793,618 | 239,468,193 | 253,731,275 |
>=900K | 0.000033% | 0.000015% | 0.000030% | 0.000015% | 0.000029% |
>=500K | 0.000426% | 0.000351% | 0.000237% | 0.000382% | 0.000236% |
>=400K | 0.000771% | 0.000600% | 0.000459% | 0.000662% | 0.000456% |
>=300K | 0.001427% | 0.001126% | 0.000814% | 0.001206% | 0.000810% |
>=200K | 0.003329% | 0.002691% | 0.002398% | 0.002795% | 0.002459% |
>=100K | 0.073574% | 0.021715% | 0.028224% | 0.024302% | 0.027192% |
Mean Reported Base Quality | 35.55073319 | 35.55122382 | 35.70586218 | 35.56471732 | 35.70625785 |
Aligned (qs>=10) | 5,652,805 | 6,378,039 | 6,136,224 | 6,338,151 | 6,170,386 |
Unmapped (qs>=10) | 9,951 | 11,207 | 10,923 | 11,266 | 10,961 |
Aligned (qs<10) | 374,496 | 384,302 | 544,008 | 383,422 | 543,983 |
Unmapped (qs<10) | 61,397 | 64,963 | 65,570 | 64,971 | 67,181 |
Bases Aligned (qs>=10) | 128,077,764,974 | 128,021,541,239 | 127,811,026,955 | 127,910,628,000 | 127,739,503,107 |
%mismatch (qs>=10) | 1.1271% | 1.2032% | 1.0342% | 1.2016% | 1.1257% |
%mismatch (qs<10) | 17.3097% | 17.5856% | 10.3937% | 17.5576% | 10.4031% |
Steps to reproduce the issue:
Please list any steps to reproduce the issue.
Run environment:
- Dorado version: 0.3.4, 0.4.3, 0.5.3
- Dorado command:
- Operating system: Ubuntu 20.04
- Hardware (CPUs, Memory, GPUs): 4xA100 80GB
- Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance):
- Source data location (on device or networked drive - NFS, etc.):
- Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
- Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):
Logs
- Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)
Hi @ymcki,
What is happening here is most likely that you are seeing the effect of read splitting which was enabled in Dorado v0.4.0. In Dorado v0.3.4 and earlier, reads were not split - this means is that occasionally two simplex reads would be called as one read (biologically, this happens because two reads follow each other very quickly in a pore).
So, v0.4.0 and above are likely giving you the correct N50, L50 etc. This is why the number of bases aligned is very similar but the L50 and N50 are different.
Hi @ymcki,
What is happening here is most likely that you are seeing the effect of read splitting which was enabled in Dorado v0.4.0. In Dorado v0.3.4 and earlier, reads were not split - this means is that occasionally two simplex reads would be called as one read (biologically, this happens because two reads follow each other very quickly in a pore).
So, v0.4.0 and above are likely giving you the correct N50, L50 etc. This is why the number of bases aligned is very similar but the L50 and N50 are different.
Thanks for your detailed explanation. I think that explains most of it.
Do you know there are special handling for reads that are concatenation of two strands of the same molecule? I still saw them in 0.5.x data. So is it being handled but missed a small number of them or not handled at all for now?
A reply in this thread says 4.3 model is not optimized for 0.4.x https://github.com/nanoporetech/dorado/issues/691 Yet I am getting slightly better numbers for 0.4.x than 0.5.x. What's going on? Is it possible to tune 0.5.x to make it perform closer to 0.4.x?
Hi @ymcki
There was one further functional change between Dorado v0.4.3 and v0.5.3, which was enabling adapter trimming (the full change list is available here: https://github.com/nanoporetech/dorado/blob/release-v0.5.3/CHANGELOG.md ).
In your results there are slightly more alignments in Dorado-v0.5.3, which may be due some additional reads crossing the alignment_score threshold now that the (unalignable) adapter sequence is excluded during alignment. The reads which sit very close to the alignment classification threshold are likely to be lower accuracy, which I believe is what is affecting your mismatch rate.
I would expect that taking results only from the reads which align with both v0.4.3/v0.5.3 will show very similar results. Alternatively Dorado provides a --no-trim
parameter which disables this behaviour; although we do not expect this to be necessary for the majority of workflows.
NB: although the version names for the models (v4.3.0) and Dorado versions (v0.4.3/v0.5.3) are similar, this is by coincidence and there are no specific optimisations to match similarly named model/ basecaller versions.
As a continuation of my benchmarking work, I looked into the alignments of the longest reads. Typically, I found that the so-called the "longest" reads (ie >200Kbp) are either a concatenation of many reads (they cannot be structural variations as I am aligning HG002 reads to its T2T genome) or a concatenation of forward and reverse strands of the same read. To measure the "true" read lengths, I only consider the primary and supplementary reads with flags 0,16,2048 or 2064 and use pysam's CIGAR parser to calculate "true" read length by taking the sum of numbers of 'M' and 'I'.
Then I get this table for the "true" read lengths and their NA50.
Sample | 0.3.4sup4.2 | 0.4.3sup4.2 | 0.4.3sup4.3 | 0.5.3sup4.2 | 0.5.3sup4.3 |
---|---|---|---|---|---|
Aligned (qs>=10) | 5,652,805 | 6,378,039 | 6,136,224 | 6,338,151 | 6,170,386 |
Unmapped (qs>=10) | 9,951 | 11,207 | 10,923 | 11,266 | 10,961 |
Pri/Supp Aln bases (qs>=10) | 127,634,045,864 | 127,537,722,176 | 127,366,784,651 | 127,544,077,062 | 127,405,621,809 |
Pri/Supp reads (qs>=10) | 6,668,152 | 6,654,021 | 6,505,974 | 6,654,067 | 6,505,957 |
Avg Length (qs>=10) | 19,140.84 | 19,167.02 | 19,576.90 | 19,167.84 | 19,582.92 |
Longest Aligned Read (qs>=10) | 122,807 | 112,117 | 122,886 | 112,110 | 112,153 |
NA50 (qs>=10) | 27,839 | 27,838 | 27,906 | 27,840 | 27,914 |
LA50 (qs>=10) | 1,927,543 | 1,926,461 | 1,920,442 | 1,926,442 | 1,920,626 |
Aligned (qs<10) | 374,496 | 384,302 | 544,008 | 383,422 | 543,983 |
Unmapped (qs<10) | 61,397 | 64,963 | 65,570 | 64,971 | 67,181 |
Pri/Supp Aln bases (qs<10) | 7,437,748,683 | 7,498,638,603 | 7,958,129,983 | 7,503,655,339 | 7,961,015,204 |
Pri/Supp reads (qs<10) | 460,428 | 470,660 | 632,768 | 470,556 | 633,305 |
Avg Length (qs<10) | 16,153.99 | 15,932.18 | 12,576.69 | 15,946.36 | 12,570.59 |
Longest Aligned Read (qs<10) | 93,575 | 93,523 | 94,235 | 93,587 | 94,171 |
NA50 (qs<10) | 26,202 | 26,169 | 25,019 | 26,176 | 25,020 |
LA50 (qs<10) | 118,454 | 119,521 | 130,134 | 119,587 | 130,162 |
Since I consider each supplementary read as a different read, so the total number of reads is more than before. We can see number of reads jumped significantly for 0.3.4. Most likely due to no or lack of read splitting. On the other hand, 4.3.0 model has a bigger jump than 4.2.0 model, so perhaps read splitting for 4.3.0 model is not as optimized as 4.2.0 model?
Interestingly, average "true" read length is ~2.2% longer for 4.3.0 model than 4.2.0 model for reads with qs>=10. However, NA50 seems to be the same for all five cases. On the other hand, for qs<10, 4.3.0 models are shorter than 4.2.0 models for both average "true" read length and NA50.
So the conclusion of all these exercises is that 4.3.0 model is better than 4.2.0 model. However, 0.4.3 dorado seems to have an slight edge over 0.5.3.