dorado icon indicating copy to clipboard operation
dorado copied to clipboard

Dorado Read Splitting Logic is broken (for RNA2)

Open patbohn opened this issue 1 year ago • 3 comments

When basecalling RNA2 data with dorado 0.5.3 it performs automatic read splitting via an open pore detection algorithm.

This algorithm is broken, as it not only detects actual open pores, but also detects short abberant signal glitches (single event spikes that shoot up to >150 pA followed by a subsequent event <50 pA) that are present in many reads (with direct RNA sequencing, at least with Kit 2). Thus, it results in abberantly short reads.

I cannot fully confirm in how many cases the algorithm performs abberrant splitting, but it does seem to be a significant number as 40% of reads are split at least once by this algorithm.

Please provide a clear and concise description of the issue you are seeing and the result you expect.

I would expect the open pore algorithm to be tolerant of the signal glitches that commonly occur with (some forms of) Nanopore sequencing while accurately detecting actual joined reads.

Steps to reproduce the issue:

Basecall RNA2 data with dorado 0.5.3, identify reads that were split by the pore splitting algorithm (read_id changed from its parent id, sp tag > 0), plot raw signal of parent reads.

Here are two randomly chosen reads where the abberant split points are marked in red:

image

It seems clear that for both reads there is not actually an open pore event, as those should be followed by the characteristic adapter - poly A signal.

patbohn avatar Mar 26 '24 08:03 patbohn

Hi @patbohn - thanks for flagging this! Yes this is indeed a lot of false positive splits. We will look into this right away.

tijyojwad avatar Mar 26 '24 13:03 tijyojwad

@tijyojwad

A small bump and further details for the potential root cause of these glitches: In this run we ran adaptive sampling and submitted a lot of reject requests. These glitches appear correlated to these requests.

Notably, what is unusual about these glitches is that a high sample is always immediately followed by a low sample. This would indicate to me that it is probably not spill-over of signal between channels, but that a sampling tick got delayed, so the sample before the delayed tick accumulated more charge than expected, and the one after the delayed tick acquired less charge than expected.

I suppose one potential solution here would be to evaluate not only one sample to detect the open pore event, but also look at the sample immediately following - if those samples are on average e.g. <100 pA then it's not a true open pore event, but an artifact generated by a delayed sampling tick (e.g. due to an interrupt triggered by a reject command).

patbohn avatar Jun 12 '24 10:06 patbohn

I have been looking into this. So far, I have not been able to find any artefacts like this in any of our RNA2 or RNA4 data. Based on what the artefacts you have found look like, I suspect that you are probably right about it being a glitch in the sampling, possibly due to large number of reject requests.

Is this something you have seen multiple times, and/or on multiple instruments? It would be useful for us to know which instrument(s) you have observed this on? Also, are you exclusively working with RNA2, or also RNA4? If the latter, have you seen this with RNA4 as well?

I will speak to the people working on adaptive sampling, and see if they have observed any artefacts like this when rejecting large numbers of strands.

kdolan1973 avatar Jun 18 '24 13:06 kdolan1973