dorado icon indicating copy to clipboard operation
dorado copied to clipboard

dorado aligner/polish: @RG header examples please

Open nextgenusfs opened this issue 6 months ago • 2 comments

We do our base calling remotely and standardized for all datasets and then store FASTQ files for subsequent downstream analysis. Currently we have been using medaka for error correction but it is sounding like dorado polish is now mature enough to use for any de novo assemblies.

However, it looks like dorado aligner seems to require the base caller BAM file for alignment, we don't hold onto this nor do I plan to, so can you let me know specifically what I need to add the BAM header so that it will work with dorado polish?

https://github.com/nanoporetech/dorado?tab=readme-ov-file#error-caught-exception-input-bam-file-has-no-basecaller-models-listed-in-the-header

Thanks!

nextgenusfs avatar May 22 '25 17:05 nextgenusfs

Hi @nextgenusfs ,

Here is an example aligned BAM file with an @RG line: https://github.com/nanoporetech/dorado/blob/release-v1.0/tests/data/polish/test-01-supertiny/calls_to_draft.bam Make sure that thebasecall_model matches your data, otherwise you may get a wrong model auto-selected.

Also, your aligned BAM needs to be produced with dorado aligner - this is a requirement for polish.

Note that if your basecalled data contains move tables (mv tag) then dorado polish can use the move-aware models and yield better output quality. (The move tables need to be generated during the basecalling process.)

Hope this helps!

svc-jstone avatar Jun 02 '25 08:06 svc-jstone

Unsurprising how many of these posts I'm seeing about fastq support for polish. Fully agreed fastq is the superior and more ubiquitous format (particularly in bacterial workflows, metagenomics, etc). Many cores default to fastq output as well for their data deliverable. fastq also enables things that bam makes very challenging, such as dynamically splitting and piping reads to different bins or barcodes, rather than relying on a slower and inflexible series of bespoke "bam-aware" tools and their dependencies. So for many reasons, it is indeed not reasonable to expect everyone switch to bam.

@nextgenusfs if you haven't seen it already: https://github.com/nanoporetech/dorado/issues/1384

fastq is by far the dominant format in my circles (metagenomics). My own tools only take fastq/fasta. I do not have any need for htslib or the human-genome-centric tools, so forcing primary storage in a semi-opaque format that has no benefit over the dependency-free, flexible and easily-parsed (with any commandline tool on any OS) fastq format, is a clear faux pas for many use-cases. bam is (what some tools use) for alignment. It's not a good medium for manipulation and transport of raw reads. There's a reason fastq still lingers as a dominant format for basecalled reads.

More discussion: https://github.com/nanoporetech/dorado/issues/1411

Okay, enough out of me for now. I don't want to be known as the FASTQ guy. 😄

GabeAl avatar Jul 16 '25 20:07 GabeAl