nanopolish icon indicating copy to clipboard operation
nanopolish copied to clipboard

RNA event alignment shows T's rather than U's in model kmer

Open Blosberg opened this issue 5 years ago • 2 comments

Hello. Thank you for your great work. I’ve found nanopolish to be an excellent research tool.

My question relates to the “Model kmer” field in the event_align output; I am trying to perform event alignment with respect to RNA data, following the instructions here.

After successfully running nanopolish eventalign, I’ve inspected the text/csv file and in the column “reference_kmer” I see various strings made up of characters A,C,G, and T (as expected). The “model_kmer” strings are also comprised of A,C,G,T, however, --but the “events” that are being “modelled” are the passage of bases A,C,G, and U (not T) through the pore -have I misunderstood that?.
I’ve checked that the fastq input files (produced by Albacore) all contain sequences comprised of A,C,G, and U, whereas the .sam files (produced by minimap2) contain “ACGT” (for the reference.)

Perhaps I’ve failed to set the appropriate option for RNA alignment, however I’ve failed to see any relevant flag from ./nanopolish eventalign –help menu, and I’m worried that I’m assuming the wrong “model” for the bases. Is this correct ? (and perhaps left this way to expedite comparison to the reference?), or should I indeed be seeing U’s in one of the “kmer” columns of the output ?

Thanks for your time.

Blosberg avatar Oct 03 '18 19:10 Blosberg

Hi @Blosberg,

For simplicity of our code we convert all Us to Ts when loading the input fastq. Now that you point this out, I realize how this can be confusing for the user. If the model_kmer field is a 5-mer rather than a 6-mer you can have confidence it selected the right model (the k-mer size is different between DNA and RNA).

I will consider transliterating the model_kmer to contain Us rather than Ts on output when processing RNA. Thanks for reporting this.

Jared

jts avatar Oct 03 '18 19:10 jts

Hi Jared. Thanks for the clarification and quick reply. I can see why leaving it this way might speed up things like regex comparisons, etc.

If the model_kmer field is a 5-mer rather than a 6-mer you can have confidence it selected the right model (the k-mer size is different between DNA and RNA).

Great, so for example when model_kmer="TTTTT" (5-Ts there), I always see model_mean=80.78, model_stdv=1.97, which I will now use as the learned, "expected" mean and stddev of current values for when 5 consecutive Uracils are in the pore.

Thanks again for developing this package! -Bren

Blosberg avatar Oct 04 '18 09:10 Blosberg