High fail-read counts at negative-strand CpGs with Dorado 1.0.2 (5mCG) and Dorado 1.1.1(5mCG and 5mC context) but not dorado 1.0.2 (5mC model)
I have a PCR fragment with defined Methylation percentages. For this issue, lets focus on 0% methylated. Goal is to be able to accurately measure the input Methylation Percentage (here 0%). For this, we have 7 defined CpG-Islands, listed here:
chrom;start;end
chrX;71111367;71111369 chrX;71111452;71111454 chrX;71111464;71111466 chrX;71111541;71111543 chrX;71111621;71111623 chrX;71111642;71111644 chrX;71111706;71111708
First of all, the samples were sequenced on a PromethION.
For basecalling, dorado versions 1.1.1 and dorado 1.0.2 were both used for C as well as CG context. The Sup basecaller was used for both. For reconstructing heres the used command:
For C: dorado basecaller sup --modified-bases 5mC_5hmC --reference /home/drk/Age_20250326_TitrMethylPCR_ONT_IP_onlybams/HomoSapienshg38CLCGenomeChr.fa /home/drk/Einzel_Fast5_gesamt/gesamt.pod5 > /home/drk/Einzel_Fast5_gesamt/gesamtsup.bam
For CG: dorado basecaller sup --modified-bases 5mCG_5hmCG --reference /home/drk/Age_20250326_TitrMethylPCR_ONT_IP_onlybams/HomoSapienshg38CLCGenomeChr.fa /home/drk/Einzel_Fast5_gesamt/gesamt.pod5 > /home/drk/Einzel_Fast5_gesamt/gesamtsup.bam
After basecalling, the resulting bam was sorted and indexed:
samtools faidx /home/drk/Age_20250326_TitrMethylPCR_ONT_IP_onlybams/HomoSapienshg38CLCGenomeChr.fa samtools sort --write-index -o /home/drk/Einzel_Fast5_gesamt/gesamtssup.bam /home/drk/Einzel_Fast5_gesamt/gesamtsup.bam
Then, the next step was the modkit pileup with modkit version 0.5.0
modkit pileup /home/drk/Einzel_Fast5_gesamt/gesamtssup.bam --ref /home/drk/Age_20250326_TitrMethylPCR_ONT_IP_onlybams/HomoSapienshg38CLCGenomeChr.fa --max-depth 2047483647 /home/drk/Einzel_Fast5_gesamt/gesamtsup.bed
Sorting, indexing and modkit pileup were similar for C as well as CG-context.
The issue somehow happens only on the 1. CpG (negative strand) and the 5. CpG (also negative strand)
If i use the older dorado version 1.0.2 in C-context, on these 2 problematic positions theres "everything okay". But if i use dorado 1.0.2 in CG-context, the counts of failed reads will rise drastically.
Dorado 1.0.2 CG-context:
chrX 71111367 71111368 h 20195 + 71111367 71111368 255,0,0 20195 0.03 7 20186 2 369 272 127 475 chrX 71111367 71111368 m 20195 + 71111367 71111368 255,0,0 20195 0.01 2 20186 7 369 272 127 475 chrX 71111368 71111369 h 6 + 71111368 71111369 255,0,0 6 0.00 0 6 0 115 3 21304 10 chrX 71111368 71111369 m 6 + 71111368 71111369 255,0,0 6 0.00 0 6 0 115 3 21304 10 chrX 71111368 71111369 h 5459 - 71111368 71111369 255,0,0 5459 0.90 49 5404 6 552 14333 445 412 chrX 71111368 71111369 m 5459 - 71111368 71111369 255,0,0 5459 0.11 6 5404 49 552 14333 445 412
chrX 71111621 71111622 h 21032 + 71111621 71111622 255,0,0 21032 0.00 1 21031 0 9 94 18 152 chrX 71111621 71111622 m 21032 + 71111621 71111622 255,0,0 21032 0.00 0 21031 1 9 94 18 152 chrX 71111622 71111623 h 12059 - 71111622 71111623 255,0,0 12059 0.03 4 12054 1 32 9043 59 79 chrX 71111622 71111623 m 12059 - 71111622 71111623 255,0,0 12059 0.01 1 12054 4 32 9043 59 79
A acceptable CpG position for comparison: chrX 71111706 71111707 h 18176 + 71111706 71111707 255,0,0 18176 0.07 12 18161 3 41 2898 30 138 chrX 71111706 71111707 m 18176 + 71111706 71111707 255,0,0 18176 0.02 3 18161 12 41 2898 30 138 chrX 71111706 71111707 h 1 - 71111706 71111707 255,0,0 1 0.00 0 1 0 24 0 21219 17 chrX 71111706 71111707 m 1 - 71111706 71111707 255,0,0 1 0.00 0 1 0 24 0 21219 17 chrX 71111707 71111708 h 17 + 71111707 71111708 255,0,0 17 0.00 0 17 0 45 6 21209 5 chrX 71111707 71111708 m 17 + 71111707 71111708 255,0,0 17 0.00 0 17 0 45 6 21209 5 chrX 71111707 71111708 h 20731 - 71111707 71111708 255,0,0 20731 0.02 4 20726 1 55 260 66 149 chrX 71111707 71111708 m 20731 - 71111707 71111708 255,0,0 20731 0.00 1 20726 4 55 260 66 149
For dorado 1.1.1 in C-context:
chrX 71111367 71111368 h 20733 + 71111367 71111368 255,0,0 20733 0.00 0 20733 0 361 221 123 0 chrX 71111367 71111368 m 20733 + 71111367 71111368 255,0,0 20733 0.00 0 20733 0 361 221 123 0 chrX 71111367 71111368 h 31 - 71111367 71111368 255,0,0 31 0.00 0 31 0 139 66 20963 0 chrX 71111367 71111368 m 31 - 71111367 71111368 255,0,0 31 0.00 0 31 0 139 66 20963 0 chrX 71111368 71111369 h 17 + 71111368 71111369 255,0,0 17 0.00 0 17 0 130 3 21288 0 chrX 71111368 71111369 m 17 + 71111368 71111369 255,0,0 17 0.00 0 17 0 130 3 21288 0 chrX 71111368 71111369 h 9696 - 71111368 71111369 255,0,0 9696 0.00 0 9696 0 555 10515 433 0 chrX 71111368 71111369 m 9696 - 71111368 71111369 255,0,0 9696 0.00 0 9696 0 555 10515 433 0
chrX 71111621 71111622 h 21192 + 71111621 71111622 255,0,0 21192 0.00 0 21192 0 11 83 18 0 chrX 71111621 71111622 m 21192 + 71111621 71111622 255,0,0 21192 0.00 0 21192 0 11 83 18 0 chrX 71111621 71111622 h 2 - 71111621 71111622 255,0,0 2 0.00 0 2 0 14 5 21250 0 chrX 71111621 71111622 m 2 - 71111621 71111622 255,0,0 2 0.00 0 2 0 14 5 21250 0 chrX 71111622 71111623 h 2 + 71111622 71111623 255,0,0 2 0.00 0 2 0 7 6 21286 0 chrX 71111622 71111623 m 2 + 71111622 71111623 255,0,0 2 0.00 0 2 0 7 6 21286 0 chrX 71111622 71111623 h 10512 - 71111622 71111623 255,0,0 10512 0.00 0 10512 0 28 10675 56 0 chrX 71111622 71111623 m 10512 - 71111622 71111623 255,0,0 10512 0.00 0 10512 0 28 10675 56 0
chrX 71111706 71111707 h 20119 + 71111706 71111707 255,0,0 20119 0.00 0 20119 0 42 1092 30 0 chrX 71111706 71111707 m 20119 + 71111706 71111707 255,0,0 20119 0.00 0 20119 0 42 1092 30 0 chrX 71111706 71111707 h 18 - 71111706 71111707 255,0,0 18 0.00 0 18 0 27 6 21209 0 chrX 71111706 71111707 m 18 - 71111706 71111707 255,0,0 18 0.00 0 18 0 27 6 21209 0 chrX 71111707 71111708 h 21 + 71111707 71111708 255,0,0 21 0.00 0 21 0 47 12 21201 0 chrX 71111707 71111708 m 21 + 71111707 71111708 255,0,0 21 0.00 0 21 0 47 12 21201 0 chrX 71111707 71111708 h 20947 - 71111707 71111708 255,0,0 20947 0.00 0 20947 0 64 175 73 0 chrX 71111707 71111708 m 20947 - 71111707 71111708 255,0,0 20947 0.00 0 20947 0 64 175 73 0
For dorado 1.1.1 in CG-context:
chrX 71111367 71111368 h 20210 + 71111367 71111368 255,0,0 20210 0.02 5 20202 3 361 258 123 486 chrX 71111367 71111368 m 20210 + 71111367 71111368 255,0,0 20210 0.01 3 20202 5 361 258 123 486 chrX 71111368 71111369 h 8 + 71111368 71111369 255,0,0 8 0.00 0 8 0 130 4 21288 8 chrX 71111368 71111369 m 8 + 71111368 71111369 255,0,0 8 0.00 0 8 0 130 4 21288 8 chrX 71111368 71111369 h 5725 - 71111368 71111369 255,0,0 5725 0.87 50 5672 3 555 14035 433 451 chrX 71111368 71111369 m 5725 - 71111368 71111369 255,0,0 5725 0.05 3 5672 50 555 14035 433 451
chrX 71111621 71111622 h 20998 + 71111621 71111622 255,0,0 20998 0.00 1 20997 0 11 94 18 183 chrX 71111621 71111622 m 20998 + 71111621 71111622 255,0,0 20998 0.00 0 20997 1 11 94 18 183 chrX 71111621 71111622 h 1 - 71111621 71111622 255,0,0 1 0.00 0 1 0 14 0 21250 6 chrX 71111621 71111622 m 1 - 71111621 71111622 255,0,0 1 0.00 0 1 0 14 0 21250 6 chrX 71111622 71111623 h 11562 - 71111622 71111623 255,0,0 11562 0.05 6 11555 1 28 9541 56 84 chrX 71111622 71111623 m 11562 - 71111622 71111623 255,0,0 11562 0.01 1 11555 6 28 9541 56 84
chrX 71111706 71111707 h 18402 + 71111706 71111707 255,0,0 18402 0.08 14 18384 4 42 2661 30 148 chrX 71111706 71111707 m 18402 + 71111706 71111707 255,0,0 18402 0.02 4 18384 14 42 2661 30 148 chrX 71111706 71111707 h 1 - 71111706 71111707 255,0,0 1 0.00 0 1 0 27 0 21209 23 chrX 71111706 71111707 m 1 - 71111706 71111707 255,0,0 1 0.00 0 1 0 27 0 21209 23 chrX 71111707 71111708 h 19 + 71111707 71111708 255,0,0 19 5.26 1 18 0 47 7 21201 7 chrX 71111707 71111708 m 19 + 71111707 71111708 255,0,0 19 0.00 0 18 1 47 7 21201 7 chrX 71111707 71111708 h 20718 - 71111707 71111708 255,0,0 20718 0.02 4 20713 1 64 243 73 161 chrX 71111707 71111708 m 20718 - 71111707 71111708 255,0,0 20718 0.00 1 20713 4 64 243 73 161
And now, the dorado 1.0.2 in C-context
chrX 71111367 71111368 h 1910 + 71111367 71111368 255,0,0 1910 0.00 0 1910 0 41 2 13 0 chrX 71111367 71111368 m 1910 + 71111367 71111368 255,0,0 1910 0.00 0 1910 0 41 2 13 0 chrX 71111367 71111368 h 8 - 71111367 71111368 255,0,0 8 0.00 0 8 0 16 1 1887 0 chrX 71111367 71111368 m 8 - 71111367 71111368 255,0,0 8 0.00 0 8 0 16 1 1887 0 chrX 71111368 71111369 h 2 + 71111368 71111369 255,0,0 2 0.00 0 2 0 15 0 1949 0 chrX 71111368 71111369 m 2 + 71111368 71111369 255,0,0 2 0.00 0 2 0 15 0 1949 0 chrX 71111368 71111369 h 1641 - 71111368 71111369 255,0,0 1641 0.00 0 1641 0 59 172 40 0 chrX 71111368 71111369 m 1641 - 71111368 71111369 255,0,0 1641 0.00 0 1641 0 59 172 40 0
chrX 71111621 71111622 h 1940 + 71111621 71111622 255,0,0 1940 0.00 0 1940 0 2 4 2 0 chrX 71111621 71111622 m 1940 + 71111621 71111622 255,0,0 1940 0.00 0 1940 0 2 4 2 0 chrX 71111622 71111623 h 1744 - 71111622 71111623 255,0,0 1744 0.00 0 1744 0 1 166 6 0 chrX 71111622 71111623 m 1744 - 71111622 71111623 255,0,0 1744 0.00 0 1744 0 1 166 6 0
chrX 71111706 71111707 h 1917 + 71111706 71111707 255,0,0 1917 0.00 0 1917 0 6 20 4 0 chrX 71111706 71111707 m 1917 + 71111706 71111707 255,0,0 1917 0.00 0 1917 0 6 20 4 0 chrX 71111707 71111708 h 5 + 71111707 71111708 255,0,0 5 0.00 0 5 0 2 0 1940 0 chrX 71111707 71111708 m 5 + 71111707 71111708 255,0,0 5 0.00 0 5 0 2 0 1940 0 chrX 71111707 71111708 h 1899 - 71111707 71111708 255,0,0 1899 0.00 0 1899 0 6 8 5 0 chrX 71111707 71111708 m 1899 - 71111707 71111708 255,0,0 1899 0.00 0 1899 0 6 8 5 0
Even though, the last data set is visibly smaller than the other, it is still from the same data set and the ratios between fail and validcov stay the same even if not only the first, but all pod5 are used. For clarification, the first 3 data sets were with all pod5s while the last is only the first pod5. The analysis for the last data set with all pod5s is currently rerunning and will be added as soon as possible. Here it is:
Dorado 1.0.2 C-context:
chrX 71111367 71111368 h 20875 + 71111367 71111368 255,0,0 20875 0.00 0 20875 0 369 67 127 0 chrX 71111367 71111368 m 20875 + 71111367 71111368 255,0,0 20875 0.00 0 20875 0 369 67 127 0 chrX 71111367 71111368 h 78 - 71111367 71111368 255,0,0 78 0.00 0 78 0 147 7 20969 0 chrX 71111367 71111368 m 78 - 71111367 71111368 255,0,0 78 0.00 0 78 0 147 7 20969 0 chrX 71111368 71111369 h 17 + 71111368 71111369 255,0,0 17 0.00 0 17 0 115 2 21304 0 chrX 71111368 71111369 m 17 + 71111368 71111369 255,0,0 17 0.00 0 17 0 115 2 21304 0 chrX 71111368 71111369 h 18315 - 71111368 71111369 255,0,0 18315 0.01 1 18314 0 552 1889 445 0 chrX 71111368 71111369 m 18315 - 71111368 71111369 255,0,0 18315 0.00 0 18314 1 552 1889 445 0
chrX 71111621 71111622 h 21252 + 71111621 71111622 255,0,0 21252 0.00 0 21252 0 9 26 18 0 chrX 71111621 71111622 m 21252 + 71111621 71111622 255,0,0 21252 0.00 0 21252 0 9 26 18 0 chrX 71111621 71111622 h 4 - 71111621 71111622 255,0,0 4 0.00 0 4 0 13 3 21252 0 chrX 71111621 71111622 m 4 - 71111621 71111622 255,0,0 4 0.00 0 4 0 13 3 21252 0 chrX 71111622 71111623 h 7 + 71111622 71111623 255,0,0 7 0.00 0 7 0 8 2 21285 0 chrX 71111622 71111623 m 7 + 71111622 71111623 255,0,0 7 0.00 0 7 0 8 2 21285 0 chrX 71111622 71111623 h 19467 - 71111622 71111623 255,0,0 19467 0.00 0 19467 0 32 1714 59 0 chrX 71111622 71111623 m 19467 - 71111622 71111623 255,0,0 19467 0.00 0 19467 0 32 1714 59 0
chrX 71111706 71111707 h 20890 + 71111706 71111707 255,0,0 20890 0.00 0 20890 0 41 322 30 0 chrX 71111706 71111707 m 20890 + 71111706 71111707 255,0,0 20890 0.00 0 20890 0 41 322 30 0 chrX 71111706 71111707 h 17 - 71111706 71111707 255,0,0 17 0.00 0 17 0 24 1 21219 0 chrX 71111706 71111707 m 17 - 71111706 71111707 255,0,0 17 0.00 0 17 0 24 1 21219 0 chrX 71111707 71111708 h 25 + 71111707 71111708 255,0,0 25 0.00 0 25 0 45 3 21209 0 chrX 71111707 71111708 m 25 + 71111707 71111708 255,0,0 25 0.00 0 25 0 45 3 21209 0 chrX 71111707 71111708 h 21065 - 71111707 71111708 255,0,0 21065 0.00 0 21065 0 55 75 66 0 chrX 71111707 71111708 m 21065 - 71111707 71111708 255,0,0 21065 0.00 0 21065 0 55 75 66 0
If anyone has useful tips or solutions for this problem, fell free to comment and Thank you very much in advance.
Hello @sxlaas,
This is a bit more of a Dorado question, but I think we can address it here.
There was a change in the 5mC_5hmC model from Dorado version 1.0.2 to 1.1.1 in that the latter picked up the v2 modified base models. The development of these models was motivated by the 5mC_5hmC model overcalling unmodified cytosine. You can read about the details in this issue. The CG version of the model didn't exhibit this problem, so the fact that the v1.0.2 CG and v1.1.1 models are similar is not surprising and in general I would say the v1.1.1 CG/C models are the ones to use, but the v1.0.2 CG model should be OK as well.
Here what you're seeing (correct me if I'm wrong) is that the "good" models (v1.1.1 C/CG and v1.0.2 CG) have high N_fail counts on a few of these (-)-strand CpGs, whereas v1.0.2 C-all-context does not. As you probably know, detection of base modification by Nanopore sequencing relies of deflections in the ionic current signal due to the chemical moieties on the bases. My guess is that this (-)-strand cytosine is difficult to call due to the deflection being small or noisy. This is represented as higher N_fail counts by the 1.1.1 models and the v1.0.2 CG model. The 1.0.2 C-all-context model on the other hand has a strong prior for canonical cytosine, so you get more confident calls in this case, but I would be reluctant to say the calls will be better by this model.
tl;dr The calls from the Dorado v1.1.1 models are probably higher quality since they use a newer model which was produced to fix a error case in the 1.0.2 C-all-context model. You could use modkit extract calls ${bedmethyl} $extract --region chrX:71111367-71111368 and plot the distribution of call_prob values, if my hypothesis is correct, you'll see that the call probabilities for this position are generally lower than the (+)-strand probabilities.