Poly(A) tail estimation post basecalling

Open danielcav opened this issue 4 months ago • 1 comments

New issue checks

[x] I did not find an existing feature request.

Dorado subcommand

Summary

Feature request

Hello,

I would like to estimate poly(A) tail lengths from my nanopore data. I noticed that the basecaller has the --estimate-poly-a option, but currently this has to be specified at basecalling time. Since basecalling large datasets can take more than a day, it would be very useful to have the option to run poly(A) estimation after basecalling, without having to re-run the entire basecalling process.

Does this functionality already exist in Dorado?

If not, would it be feasible to add a standalone poly(A) estimation step that works on existing Dorado outputs (FASTQ/BAM/POD5)?

I have also looked into Nanopolish and Tailfindr, but both rely on FAST5 input, which is now obsolete, I guess. Are there any existing tools, or plans to adapt similar approaches, that can work directly with POD5 and Dorado outputs?

Thank you very much for your time and for developing Dorado!

Aug 25 '25 16:08 danielcav

Hi @danielcav,

Thanks for raising this. Dorado polyA estimation relies on the move table information which, at this point, requires basecalling. We agree that a post-run method that reads the move table and sequence information from the BAM file record and the signal from the pod5 would greatly improve throughput for reanalysing data, but it's not something we've worked on as yet.

Aug 26 '25 08:08 malton-ont