The prediction results showed a low recall
Hi:
As I said in the previous issues,I'm trying to use taiyaki to train my modified RNA model.But I found the prediction results showed a low recall.I'll go through the process of getting results.
1.Follow the instructions to train a modified base model(About 140k reads covering 2000 transcriptome modification sites,Approximately 1.5 modified bases per read).
2.Using model to basecalling the training set itself again.
basecall.py --device 0 --modified_base_output basecalls.hdf5 ${trainning_set_reads} training2/model_final.checkpoint > basecalls.fa
3.Basecall. fa was mapping to transcriptome by minimap2 to get bam file(minimap2 -t 8 -ax splice -uf -k14 ${transcriptpme} ${workspace}basecalls_reversed.fa > ${workspace}basecalls.sam).The threshold value was set to obtain the modified base coordinates on reads(Handling basecalls.hdf5 files).
4.Converting the coordinates of modified bases with read as the origin to the coordinates with transcriptome as the origin(Process cigar in the bam file using r.get_reference_positions of pysam).
5.Calculating how many of the 2000 transcriptome modification sites are covered by modified bases.I found that only about 6% positions were recalled.I don't know what the problem is, maybe my test method is not suitable, or the training problem.
Unfortunately, this downstream analysis software does not support RNA data.So, does downstream analysis of taiyaki currently require users to do it themselves?
https://github.com/nanoporetech/megalodon
supplementary question: Does the decline of the loss curve indicate that my training is effective and correct?
Hi: I found that there was a big difference between the original sequence(Generated during Gridion sequencing) and the sequence generated by the modified base model. here is an example.
'read_id','original_seq','predict_seq','original_identity','identity','original_align_trans','original_align_start','original_align_end','predict_align_trans','predict_align_start','predict_align_end','predict_identity'
5ec71e76-bd51-457d-b1f1-ad551ddae980 ACUACGAGGGUACCCGGGUAAUGCCGCAGGAAUUCUUUGGCCGCAGAACUUCCUUUCAAAGGCUAUUGCUACGGCGGAAAUUAGCUAAUGGUUACAGAAAGAAGCCCAUCCCCUCGGCCCGAAUAGUUUAUCGAAAUCGAGAAGGGCAGUGGAAAUUUCGACGGUCCUGACCGCAUCUCCGACUCCAUCCUUCCCCUGCACUCCUAACAGCAGCAUGUCCAAGAGCCCAUACCUGUCACUUUAGCCACUGGAGCUUCCACCGUCCCUGGCCUCUAAACCAUUGUAUCAUGGCUUCGCGACAAUUAAUGUGCCAAUACAUUCUUACGAGAGUUGGCCUUCAUUCUACCGAGUUUGAUAUGUACUUAUGCUGUUAGUGGUAGUUUAAAGCAUACCACCAAGGUACUGGCUUUGAUUUGUAUUAACCCUAGCAAGCCAUCUUAGCCAGGCUAUUGUGGAAAAGUCAACACCAUGAAGUUUGGAUGAUUGUUGCUGGAUCUUUUUCGGCUAAAUACAAGCUCCACGAAAUAGUCCGACGUCAGCCAAAUUCUUUGGAGCUAUCAAGGCUGCAAUGGCCUACUUCGCCUUAAUAAAACACCGACCAAUAUUUGAAAUCCCCCGUUGAAGCUGAAGUCAGGAUCAUUCCACUAUGCAAAACGUUGUUCCAUGAUCGACGCAGCACAAUUGGUAAUAGACGAGUCAAGGAUAUGUAAGUGGUAAAACUACGGCUGUGAUAUUCGCAUCGUGGACUUUGAGAAUCCCAAAGAAAAUGAGAAGAACCUUCUGCUCAUGGCGGUAAAGCCUAUGCAUGGCUAAGGCUGAUGACGAAAACUGAUGGUCAGUUGUUAUUUUGCCUCAGAAUGGUUGGUCACUAAUGCAGAAUGGAGCAACCUCGGCGGCUGUUAAAUCGUUUGCGCAUUGAUUCAUAUCACCACACUAUGGGUGACAUGUGGUGGCGAACUGUUAUGGUUUUUGUCCCCAGAGUUCUAAACAAGUCUAAUCGAAGAUAUGUUACGAAUCGUCUUUGGCAAAAAGAUUUUCGAGUGUCGAUUAAGCAGAUUCUCGUGGUCCAUUACUCCUUACUUUUUCCCAAGGUAUUGCGUCGCACUCGCGAUAACAUAAACCCUGAGAACAUUUUGAUCUCUUGGGUAUGGUGUUUUUAGAAUUCAUGGGGAUUUUUGGCGCUAUGGAUGUUUGUGCAAUCCCAGCGAUGAUAACUGUUGUGGAAUCAAAGAGUUAUGGGUAUUUUGGUGACCCCCCUCGGCUCCGUGAAAUUCGACUUCCAGGCUUUACAAUCCUUUUAUUGAUAUGACACACCUUAAUUAUCGAAUUCUACAUUGUCAUAUUUUUCGCCACGUUUCCCUUCCGCUAAAGCUUUUUCACGAGAACACCAGUUGGUACGCAAAAUUGUAAAGCUUGGAGAGAGUUUACAAUCUCAUUUGUGCACGUCCAAAAUAUUCGUAUUAAGAUCGCCAUUUUAAACCUGAUCGCAAUGACCUUUUUCCGAAUGCAGCAGGUUGUCUUGUCAUGGCAAAUCCCCAUUGCCAAACAAAGAAUUACUGCCCCCAAGAGAAGCUUUAAUCUCCCUUUGUCCAAGAAAUUGGCGCUGCUUUGUCGAUGAUUGGUUUAUGAUAUCCAGAGUGCCUUGAAACGAGCCAAAUCCUCCCGCUUCAGAGGACUAUCUACGCCACAAUUCAUGUUGAGAUUUUCGUAAUGGCUUGCUUCUUUGUUUUGAAGACGAUUAGUGUUUCAGAUUUUUUUUUAAUUAGCCAUUUUCUUGGUUGUUUGAAUUCAUCAACUCCGCUGCGCAUACUCCUCUCUUUUACCUUAUACUAUUAAUUUUACGACCGAAUUUCGAAAUGAUGUGUGUGUUUUGGCUGCCCCUUCGAAUGGUUACACUUCGACCCGAAGCUAUUUUAUUUGGAACAUGCAAGGCCCGAAAAAAUAAAAACGUCCAAGUUGGACUGACUUGCAAAUUAUAUCUACAAAAAGAAGAUGCCGCAUUGUUUUUUGUAUUCGAAUUUAGUUGAUGUAAUUUUUUAAUUUUUUAAAAUUCCAUCUUCCAUAAUUGCUACUUAUGCUUAAAUUUGGGACAAGCUACGUAUUGUAACGGCAUAGUGAGGAAAAUGGCCUUGUUUAAUAAUUUUUUCAGAGCAUUUUUUAUCAUAUUUGUAAUAAAGUCAGUAACGGCUUCGACUAAUCUACUGUUUGCGCUUACCUUAAAUAU ACCAAGTACCCGCGGTAATGCCAAGGATTTTTGGCCGCAGAACTCACCTTCAAGGCTGCTACTCGGCGGAAATTAATAATGGTTACAGAAAGAAGCCCACTCCCTTGCCCGAATAGTTTATTTGGATCGAGAAGGCAGTAATGGAAGCTATTCGACGGTCCTGACCGCATCCGACAACCATCCTCCCTACTGGCACTCCCTCATTGCTGCATGTCCAAGAGCCCATTTCGTCATTTAGCCACTGAGCTTCCTACCGTCCTGGCCTCTAAACTCATTAGTATCATCGTGTTCGCGACAATTATGCCAATACATTCTTACGAGAGTTGGCCTCATTCACTACCGAGTTTGATATGTTTACCTATGCTGTTAGTGGTAGTTTAAGCTCTCCTCCAAGGTACTGGCTTTGATTGTATTAACCCTGCAAGCCATTTAGCCAGCTATTGGAAAGTCATCCATGAAGTTTAGATGATAATGTTGGATTTTTCGGCTAATACAAGCTCCAAATAGTCCGACGTCAGCCAAATATTCTTGAGGCATCCAAGGCTGCAATGGCCCCTACTACTTCGCTCCTATAAAACACCGACCAATATTCTGAAATCCCCGCAAGCTGAAGTCAGGATTTATTCCACCATATGCAAAACGCGTTGTTCTCGATTGAGCGCTGAATACAAATTGTTGACGAGTCAAGGATATGACTGGTGGTGCTACGCTGTGATTCGCTCGTGGCTCTTCGTGAGAATCCCAAAGAAAATGAGAAGAACCTTCGCTTTTTGCAGTAAAGCCTATCGTGTTAAGGCTGACGACGAAAACTGATGGTCAGTGCTGCTATTATTTGCCTCAGAATGGTTGGTTCAATGCAGAATGGAGCATCCTAATGTTGTTAAATCGTATGATTTGTGCATTGATTCTCATATTCCTCTCTATAGTGACACGTGGTGCGCTGTTATGGTTCGTCCCAGAGGATCTGCTAAGTCTATCGAAGATATGACACGATCGTCTGGCAAAAGATTTTAGTGTCGATTAAGCAGATTCTTCGTGGTCCTCAACTCCATTCCCAAGGTATTGTGCGCATTCGCGATATTATGACCTGAGAACATTTTGATCTTGAGTATGGTGTTTTTAAGAATTATGGATTTTTGGCGCATGTGATGTTGTGCAATCCCGGTGACGATATTATGTCTGTGGAATCAAAGAGTATGGGTATTTGGTAGTGACCCCTCGAATACCTGAGCTTTGACACCAGGCTTTACAATGCCTTTTTGCTGATATGTGGTCTACCGCTTTATGGTTACATTGTCTATATTTCGCACATACCTTTCCGCAAAGCTTTCAGAACGACCAGCTTGTACGCTAATTGTAAAGCTTGGAGAGTCAATCTCATTTGCGATGTCCAAAATATTTCGTATATCAAAGATTTGCCATATTTTAAACCAGTCAATGACCTTTCCGATGCTATGGGTTGTTTTTCTGTTCTGGCAAATCCCCATACTGCCGAACAAAGAATTACTGCCCAAGAAGCTTTAATCTTCCCTTTGTCCAAGAAATTGAATGCTGCTTCAATTGATGATTGGTCATGCACTATGATGCTCCAGAGGTGCCTTGAAAACGAAGCAAATCCTCCCTTCGCTTCAGAAGTTATCTACGCCACAATACATGTTGAGATTTTCGTAATGGCTTGCTCTTGTTTGAAGACGATTAGTGTCACGAGATTTTTAATAGCCATTTCTTGGTTGTTTGATTATCATCAATGCACGTTGTGTGCATACTCCTTCTTTACTCTCCTTATACTAACTATTTCATGTACGATCTTCGAAATGATGTGTGTTTTGGCTGCCCTCTGCTGGGTTCTCACTTTTCGACCTAAGCTATTTATTTGGAACCGCAAGGCCAAAAATAAAACGTCCAAAGTTGACTGACTTCGCAAATTATACAAAAAGAAGATGCCGCATTGTTTTGTATTTGCTTAGTTGATGTAATTTTTAATTCTTGAATTCCATTTCCATGCTGCATTTATGCTTAGCTACTGATGCTACGTAATATATGGGCCAACAGTGAGGAAAATGGCCTTGTTTAATAATTTCAGAGCATTTCTTTATCATATTTTTGTAATAAAAGTCAGTAACGGCTTCGACTGCTTACTGTTAAGTATACCTTGATCCTCCCTGCCCCATCATTCGCTTCATTATCTATCCATCCCTCATCATCACCAAATTCCA 84.92208982584785 59.14414414414414 NM_001023527.2 2894 5076 ('NM_001023527.2',) 2864 5062 81.74807197943444
In my opinion,because the sequence composition changes, alignment information changes,.So it is hard to predict the known positive site of input successfully.
loss in mapped squiggle for RNA reads is also compounded by polyA absent from guppy called fastq.
supplementary question: Does the decline of the loss curve indicate that my training is effective and correct?
I guess the loss curve is from the second round of training? This looks very similar to mine. I see greater data loss during the first round and produces a curve very similar to the one produced with the walkthrough_modbase exercise.
loss in mapped squiggle for RNA reads is also compounded by polyA absent from guppy called fastq.
So you have encountered a similar situation?My friend:)
I guess the loss curve is from the second round of training? This looks very similar to mine. I see greater data loss during the first round and produces a curve very similar to the one produced with the walkthrough_modbase exercise.
Yep,this is from second round.
Sorry for the slow response! I hope I can help you to diagnose the problem.
Yep, this is from second round.
Ok, good. If this was the first round of training I would say something looks wrong, but for refining a model this looks fine. The plots showing the mapped reads also look ok. Let's assume the training is ok for now and check the basecalling.
I found that there was a big difference between the original sequence...
The example you gave has an accuracy of about ~77% which is... not great. One possibility is that the default chunk size used by basecall.py is too short for RNA.
Could you try basecall.py --chunk_size 12000 ... and see if that makes a difference? Perhaps you could post the new basecall for that same read for comparison?
Thanks a lot!!!I will try to alter the parameter and evaluate the results.Keep in touch.
Update:
After I update the parameters for basecall.py, this is a new statistic result for the same UUID:
5ec71e76-bd51-457d-b1f1-ad551ddae980 ACUACGAGGGUACCCGGGUAAUGCCGCAGGAAUUCUUUGGCCGCAGAACUUCCUUUCAAAGGCUAUUGCUACGGCGGAAAUUAGCUAAUGGUUACAGAAAGAAGCCCAUCCCCUCGGCCCGAAUAGUUUAUCGAAAUCGAGAAGGGCAGUGGAAAUUUCGACGGUCCUGACCGCAUCUCCGACUCCAUCCUUCCCCUGCACUCCUAACAGCAGCAUGUCCAAGAGCCCAUACCUGUCACUUUAGCCACUGGAGCUUCCACCGUCCCUGGCCUCUAAACCAUUGUAUCAUGGCUUCGCGACAAUUAAUGUGCCAAUACAUUCUUACGAGAGUUGGCCUUCAUUCUACCGAGUUUGAUAUGUACUUAUGCUGUUAGUGGUAGUUUAAAGCAUACCACCAAGGUACUGGCUUUGAUUUGUAUUAACCCUAGCAAGCCAUCUUAGCCAGGCUAUUGUGGAAAAGUCAACACCAUGAAGUUUGGAUGAUUGUUGCUGGAUCUUUUUCGGCUAAAUACAAGCUCCACGAAAUAGUCCGACGUCAGCCAAAUUCUUUGGAGCUAUCAAGGCUGCAAUGGCCUACUUCGCCUUAAUAAAACACCGACCAAUAUUUGAAAUCCCCCGUUGAAGCUGAAGUCAGGAUCAUUCCACUAUGCAAAACGUUGUUCCAUGAUCGACGCAGCACAAUUGGUAAUAGACGAGUCAAGGAUAUGUAAGUGGUAAAACUACGGCUGUGAUAUUCGCAUCGUGGACUUUGAGAAUCCCAAAGAAAAUGAGAAGAACCUUCUGCUCAUGGCGGUAAAGCCUAUGCAUGGCUAAGGCUGAUGACGAAAACUGAUGGUCAGUUGUUAUUUUGCCUCAGAAUGGUUGGUCACUAAUGCAGAAUGGAGCAACCUCGGCGGCUGUUAAAUCGUUUGCGCAUUGAUUCAUAUCACCACACUAUGGGUGACAUGUGGUGGCGAACUGUUAUGGUUUUUGUCCCCAGAGUUCUAAACAAGUCUAAUCGAAGAUAUGUUACGAAUCGUCUUUGGCAAAAAGAUUUUCGAGUGUCGAUUAAGCAGAUUCUCGUGGUCCAUUACUCCUUACUUUUUCCCAAGGUAUUGCGUCGCACUCGCGAUAACAUAAACCCUGAGAACAUUUUGAUCUCUUGGGUAUGGUGUUUUUAGAAUUCAUGGGGAUUUUUGGCGCUAUGGAUGUUUGUGCAAUCCCAGCGAUGAUAACUGUUGUGGAAUCAAAGAGUUAUGGGUAUUUUGGUGACCCCCCUCGGCUCCGUGAAAUUCGACUUCCAGGCUUUACAAUCCUUUUAUUGAUAUGACACACCUUAAUUAUCGAAUUCUACAUUGUCAUAUUUUUCGCCACGUUUCCCUUCCGCUAAAGCUUUUUCACGAGAACACCAGUUGGUACGCAAAAUUGUAAAGCUUGGAGAGAGUUUACAAUCUCAUUUGUGCACGUCCAAAAUAUUCGUAUUAAGAUCGCCAUUUUAAACCUGAUCGCAAUGACCUUUUUCCGAAUGCAGCAGGUUGUCUUGUCAUGGCAAAUCCCCAUUGCCAAACAAAGAAUUACUGCCCCCAAGAGAAGCUUUAAUCUCCCUUUGUCCAAGAAAUUGGCGCUGCUUUGUCGAUGAUUGGUUUAUGAUAUCCAGAGUGCCUUGAAACGAGCCAAAUCCUCCCGCUUCAGAGGACUAUCUACGCCACAAUUCAUGUUGAGAUUUUCGUAAUGGCUUGCUUCUUUGUUUUGAAGACGAUUAGUGUUUCAGAUUUUUUUUUAAUUAGCCAUUUUCUUGGUUGUUUGAAUUCAUCAACUCCGCUGCGCAUACUCCUCUCUUUUACCUUAUACUAUUAAUUUUACGACCGAAUUUCGAAAUGAUGUGUGUGUUUUGGCUGCCCCUUCGAAUGGUUACACUUCGACCCGAAGCUAUUUUAUUUGGAACAUGCAAGGCCCGAAAAAAUAAAAACGUCCAAGUUGGACUGACUUGCAAAUUAUAUCUACAAAAAGAAGAUGCCGCAUUGUUUUUUGUAUUCGAAUUUAGUUGAUGUAAUUUUUUAAUUUUUUAAAAUUCCAUCUUCCAUAAUUGCUACUUAUGCUUAAAUUUGGGACAAGCUACGUAUUGUAACGGCAUAGUGAGGAAAAUGGCCUUGUUUAAUAAUUUUUUCAGAGCAUUUUUUAUCAUAUUUGUAAUAAAGUCAGUAACGGCUUCGACUAAUCUACUGUUUGCGCUUACCUUAAAUAU ACCAAGTCTCCATGGTAATGCCAAGAATTTTCTTGATCGCAGAACTTCCTTCAAGGCTAATTACATGGCGGAAATTAATCAATGGTTACAGAAAGAAGCCCTCTCCCTTGCCCGAATAGTTTATTGCTGAGAAGGCGGTAATGGAAGCTATTCGACGGTCCATTGACCGCATCTCCGACTCCTCCCTGCACTCCTACTGCATGTCCAAGAGCCCATTCTGTCATTTAGCCACTACTGGAGCTTCCTACCGTCCTGGCCTTCTAAACATTAGTATCGTGTTCGCGACAATTATGCCAATCATCACACGAGAGTTGGCCATTCACTACCGAGTTTGATATGTTTACCATGCTGTTAGTGGTAGTTTAAGCTCTCCTCCAAGGTACTGGCTTTTGATTGTATTAACCCTGCAAATCCATTTAGCCAGGCTATTGGAAAGTCATCCATGAAGTTTGGATGATAATGTTGGATTTCGGCTAATACAAGCTCCAAATAGTCCGACGTCAGCCAAATATTTTGAGGCATCCAAGGCTGCAATGGCCCTACTACTTATCCTATAAAACACCGACCAATATTTCTGAATCCCCGCAGCTGAAGTCAGGATTTATTCCATATGCAAAACGCGTTGTTCCTCGATTGAGCGTAGAATCAATTGGTTGATGAGACCAAGGATATGTAGTGGTGCTACGGCTGTGATACGTTTGGACTCTTGAGAATCCCAAAGAATGAGAAGAACCTTCGCTTTGCGGTAAAGCCTATCGTCGTTAAGGCTGATGACGAAACTGATGGTCAGTATTGCTAAACTTGCCTCCAGAATGGTTGGTTCAATGCAGAATGGAGCATCCTAATGTTGTTAAATCGTTGATTTGTGCATTGATTCTCATTATCTTCCTCTCTATAGTGACACGTGGTGCGCTGTTATGGTTCGTCCCAGAGGGATCTGCTAAGTCTAATCGAAGATATGACACGATCGTCTTGGCAAAAAGATTTCGAGTGTCGATTAAGCAGATTTCGTGGTCTCATTACTCCACATTCCCAAGGTATTGTGCGCATTCGCGATATTAAACCTGAGAACATTTTGATTTAAGTATGGTGTTTTAGAATTACTGATTCTTGGCGCTGTGATGTTTTGTGCAATCCCGGTGACGATATTACTGTTGTGGAATCAAAGAGTATGGGTATTTGTAGTGACCCCTTCGGCTCCTGAATTTGACAAGGCTCTTACAATGCCTTTTTGCTGATATGTGATCTACCGCTATCGATTTTACATTGTCTATATTTCGTCATACCTTTCCGCAAAGCTTTCAGAACGACCAGTTGTACGCAATTGTAAAGCTTGGAGAGTACAATCTCATTTGCGATGTCCAAAATATTCGTATATCAAAGACTTGCCATATTTTAAACCAGTTTATAATGACCTTTCCGATGCTATGCAGCAGTTGTTTCTGTTTGGCAAATCCCATTGCCGAACAAAGAATTACTGCCCAAGAAGCTTTAATCTTCCCTTTGTCCAAGAAATTGAATGCTGTTTTGTCGTCGATTGTACACTATGATGCTCCAGAGGTGCCTTGAATGGGCAAATCCTCCTCGCTTCAGAAGTTATCTACGCCACAATCACTCGTTGAGATTTCGTAATGGCTTGCTTTGTTTGAAGACGATTAGTGTACGAGATTTTTAATAGCCATTTCTTGGTTGTTTGATTTTTATCATCAATGCCACGTTGTGTGCATACTCCTCTCTTCTCTCCTTATACTAACTAATTTCATGTACGATTTTCGAAATGATGTGTGTTTTGTTGCCCTTTGCTGGTTACACTTTCGACCTAAGCTATTTATTTGGAACCAAGGCCAAAAAAATAAAAACGTCCAAAGTTGACTGACTTAATAAATTATATACAAAAAGAAGATGCCGCATTGTTTTGTATTTGCTTAGTTGATGTAATTTTTTAATTCTTTAAATTCCATTTCCATTGCTGCATTTATGCTTAGCTACTAGTCAAGCTACGCAATATATGAAAATGCATAGTGAGGAATGGCCTTGTTTAATAATTTCAGAGCATTTTTATCATATTGTAATAAAAGTTCAGTAACGGCTTCGACTGCTTACTGTTTAGCGA 84.92208982584785 59.30366612866036 NM_001023527.2 2894 5076 ('NM_001023527.2',) 2849 5063 78.09798270893371
Uh,it seems to be getting worse...
On the other hand,I noticed that an conceptual error in my custom script: the order of
cond_logprobs is 3' to 5'.But list returned of get_reference_positions is 5' to 3‘.I fixed it,The recall rate is now 47.7% when threshold value was set at 0.2.
Would it be possible to produce a ROC or Precision-recall curve for this threshold? While in theory these values correspond to log probabilities, in practice they often require calibration. See some discussion on this topic in megalodon docs here.
@marcus1487 Thank you for your advice!I'll read the megalodon docs later.
Update:
I get a curve that looks like this,after the threshold(min_coverage) is strictly controlled(90%)
Obviously, my model still has a lot of room for improvement.@marcus1487 @myrtlecat ,Could you give me some suggestions for improvement?Thanks!
A number of improvements to the training scripts have been implemented in the latest release. In this latest release I have found that a --mod-factor 1 from the beginning of training performs better than the previous recommendations. Additionally, the starting learning rate may need to be lowered a bit for modified base training. I haven't done enough testing on how this interacts with RNA training at this point, so that is more of an open research question at this point. Hopefully this helps in your research here.