remora icon indicating copy to clipboard operation
remora copied to clipboard

Remora re-squiggle (refine_signal_map.SigMapRefiner) on RNA data

Open mem3nto0 opened this issue 3 months ago • 4 comments

Dear ONT team,

I am trying to re-squiggle my RNA data using Remora after I basecalled the data with Dorado. Comparing the analysis from Tombo and Remora on the same dataset, I obtain different results. The notebook page of Remora says that Remora setup is tested and adjusted for DNA Kit 14 9-mer, but no specification for RNA data. Should it work the same?

Additionally, the analysis on synthetic RNA data (where I know the modification position) shows consistent results using Tombo, while with Remora the results don't converge in the modification position.

My questions are: There is any setup for Remora that is more suitable for RNA data? If the answer is no for the first question, there is any pipeline for base-calling with Dorado and then using Tombo?

thank you for your time and attention. Kind regards

mem3nto0 avatar Mar 27 '24 11:03 mem3nto0

The k-mer models for RNA need to be used. These can be found in the kmer_models repository as noted in the README. The reverse_signal flag should also be set where applicable for RNA reads as the signal proceeds from 3' to 5' ends of the RNA. Finally the latest Dorado release (v0.5.3) should be used as there were some bugs related to RNA trimming/splitting in previous Dorado versions effecting the move table and thus Remora analyses. Some additional bug fixes are coming in the next Remora release to handle some edge cases in move table parsing, but the vast majority of reads should be handled correctly with the latest Dorado and Remora releases.

marcus1487 avatar Mar 28 '24 15:03 marcus1487

Hi @marcus1487, we talked together with you and Logan a few days ago.

Sorry for hijacking the issue (please let me know if I should open a new one), but I just stumbled on a bug related to the move table when resquiggling RNA004 data and wanted to make sure it is the same thing that you're preparing a fix for.

We re-basecalled our data with Dorado v0.5.3+d9af343 and I just tried resquiggling it again with Remora, but I'm getting a:

remora.RemoraError: Move table discordant with signal                         

So I have a few questions:

  1. Is this likely to be resolved with the upcoming fixes in the next release?
  2. When do you expect the release to be published?
  3. Can you suggest a workaround that we can do in the meantime? I tried changing the code directly so that I pass missing_ok=True to the get_io_reads function call here and it seems to resolve the issue (I guess we're just losing some problematic reads). Do you think that this approach is okay?

Thanks, Mihail

mzdravkov avatar Mar 28 '24 16:03 mzdravkov

@marcus1487 Thank you for the reply,

When I analyzed the data, I set already the reverse_signal=True and I chose the suggested kmer-model for RNA (rna_r9.4_180mv_70bps). I saw that my Dorado is not in line with the last update and I will check with the new version.

But I would like to still ask about a few elements in Remora. It is possible in the software to change the "sd_params", which are elements designed for the re-squiggle. In the README of Remora, it says that the pre-settled values are tested for DNA. They can be used also for RNA?

Additionally, changing do_rough_rescale, scale_iters, and do_fix_guage settings in the "refine_signal_map.SigMapRefiner" the re-squiggle changes significantly. There are specific settings to use for sd_params, do_rough_rescale, scale_iters, and do_fix_guage to analyze RNA data?

Thank you for your time and attention. kind regards.

mem3nto0 avatar Apr 01 '24 10:04 mem3nto0

@mzdravkov 1. Yes, this is likely to be resolved in the next release. 2. I do not have a concrete timeline for this release, but hope to have it out in the next couple of weeks after some stress testing of other features. Could possibly look at a pre-release pushed to github, but not tagged/published as a release. 3. No good workaround. There were some incorrect assumptions concerning some of the tags around signal trimming and splitting which have been resolved. This problem seems to be larger for RNA runs, and quite rare in DNA runs.

@mem3nto0 The sd_params generally work quite well for RNA in our hands. These parameters control the short dwell penalty. Specifying a longer array will increase runtime (potentially significantly for much larger values), but may provide some marginal benefits for slower speed runs (especially in RNA). We have not extensively tested this and thus recommend the default values for this paramter which do work quite well.

I have indeed found that increasing the scale_iters value to >0 can cause some interesting edge cases (e.g. scaling signal down to 0). I would strongly suggest leaving this value set to 0 for most all signal metric extraction settings. I've considered making this parameter boolean (essentially between -1 and 0), but it seems that the functionality may be useful at some point so have left it for now. For signal plotting/signal metric extraction I would suggest using do_rough_rescale=True and do_fix_guage=True for most all cases.

marcus1487 avatar Apr 02 '24 20:04 marcus1487