remora icon indicating copy to clipboard operation
remora copied to clipboard

how to interpret the results from "remora infer from_taiyaki_mapped_signal"

Open jon-xu opened this issue 2 years ago • 8 comments

Thanks for the great tool!

Just wondering how the "read_pos" in the results from "remora infer from_taiyaki_mapped_signal" are chosen - so which positions of a read are shown in the result, please? If only modified positions are shown, why will we have different class_pred values? Also the read_pos is the relative position within the read but not the position

And for "class_pred", 1 means modified and 0 means unmodified, don't they?

What's the meaning of "label", please?

Thanks! Jon

jon-xu avatar May 19 '22 04:05 jon-xu

Read position is the position within the taiyaki read sequence. The label is defined in the same way as the taiyaki mapped signal file. See the taiyaki alphabet for definition of the labels. For a standard modified base dataset with a single modified base, label 0 is canonical and 1 is modified.

marcus1487 avatar May 19 '22 05:05 marcus1487

Thanks Marcus, I'll double check the read positions and let you know.

jon-xu avatar May 19 '22 23:05 jon-xu

waiting for admin to install the latest gcc on HPC...

jon-xu avatar May 31 '22 01:05 jon-xu

Hi Marcus,

In my last successful run, I set --num-reads 10000. This time the model training step ran for a week for my full dataset and only finished 4 epoches. Do you recommend parallelize the job with more CPUs? Or do you recommend downsampling the data instead, please?

Thanks, Jon

jon-xu avatar Jun 04 '22 04:06 jon-xu

I'm now running it with 32 cores, and it's about 5 hours for one epoch, which might give me 30 epoches after the job time runs out.

jon-xu avatar Jun 05 '22 23:06 jon-xu

Hi Marcus, I've finished 30 epoches training and had some inferred data for the unmodified sample. The class_pred are 0 for all listed bases which means they are unmodified. Then I took one read as example, and checked the listed read positions against the read in the basecalled FASTQ output, and it doesn't seem the selected positions are all U's which we are testing modification on... Also I see the labels are all -1, not sure what that states for...

Could you please elaborate a bit more about these two fields? Many Thanks! Jon

jon-xu avatar Jun 14 '22 01:06 jon-xu

-1 labels mean that no label was found for that read. If inferring on a canonical taiyaki dataset this is expected.

Taiyaki read datasets are anchored to reference bases not basecalls. That is why you are seeing mismatches there.

Finally judging from the U canonical base, I'm guessing this might be an RNA modified base. Remora is not compatible with RNA data (or any data with signal in 3' to 5' sequencing direction). So results from this data may not act as expected.

marcus1487 avatar Jun 14 '22 01:06 marcus1487

Thanks Marcus! Yes it's RNA modified base. Interestingly, we can still detect modified bases using Remora - the inference result on signal mapping file of the modified sample consists mostly modified versions. But not sure what the label "1" stands for...

Is there any other tools/pipelines you would suggest us to train a model to detect RNA modifications, please?

jon-xu avatar Jun 14 '22 01:06 jon-xu

This output has been swapped for the new remora infer and validate commands directly from pod5 and bam. Unfortunately RNA support has been implicitly removed with the 2.0 release. We are working to add this support back. Hopefully the output of the new commands is more self explanatory.

marcus1487 avatar Dec 21 '22 17:12 marcus1487

Follow updates on RNA support here: #48

marcus1487 avatar Dec 21 '22 17:12 marcus1487