duration information of base calls
New issue checks
- [x] I did not find an existing feature request.
Dorado subcommand
Basecaller
Feature request
Is it possible to find the raw signal durations of each base call? The moves table is based on the signal segmentations, and it seems there is no obvious way to find how the segmentation was done by dorado.
Hi @daniel-es6,
This information is contained in the ts (trimmed samples) and ns (number of samples) tags. See https://software-docs.nanoporetech.com/dorado/latest/basecaller/sam_spec/#read-tags:
the basecalled sequence corresponds to the interval signal[ts : ns] the move table maps to the same interval. note that ns reflects trimming (if any) from the rear of the signal.
Thanks for getting back, this was useful. Looking at a read with move tables and found this : "ns:i:55330 ts:i:1714". Does it mean the raw signals from 1714 to 55330 in the pod5 file were used? Another related question, does each move always have the same number of raw signals?
@daniel-es6,
Yes, that's exactly what that means.
The move table describes the signal to base mapping. The first element is the stride - the number of samples per entry. Each entry in the move table is either a 0 (no new base) or a 1 (new base). So for a move table that looks like:
mv:B:c,5,1,0,0,1,0,1
The first entry is a 5, so each entry is 5 samples. There are 3x1, which should match 3 bases in the sequence. The first 1 occurs in the first event, so the first base corresponds to the first 5 samples. There are then 2x0 before the next 1, so the second base corresponds to the next 15 samples, and the last base is 10 samples.
See https://github.com/nanoporetech/dorado/blob/release-v1.2/dorado/utils/sequence_utils.cpp#L251 for how dorado converts a move table into a map of signal points for each base.