dorado icon indicating copy to clipboard operation
dorado copied to clipboard

i miss guppy's barcoding_summary.txt (enhancement)

Open Puputnik opened this issue 1 year ago • 2 comments

Hi everyone,

I was an enthysiastic guppy user and, although i love dorado, i really miss the barcoding_summary.txt file that was produced during demultiplexing/barcode-trimming (and, as far as i've understood, there is no way to get something similar using dorado). It was very useful for various reasons: to easily check the trimmed sequences, perform ad-hoc filtering based on barcodes scores and, most importantly, select double barcoded reads without having to re-run the demultiplexing (again, with ad-hoc criteria based on barcode scoring).

I know that, at least for the latter issue, this can be done by using --no-trim during basecalling, and then perform 2 different demux (one of them with the flag --barcodes-both-ends), but it's definitely not handy, expecially when looking for base modifications (for which i would have to fix the MM ML tags with modkit, since i'm forced to trim the barcodes after the basecalling)

It would be great to have a flag to produce a similar summary file, built in dorado basecalling (when using --kit-name). In this way, i could directly produce an aligned and trimmed bam via dorado basecalling and keep flexibility about later filtering strategies. This would be a life saver (or at least, a big time saver).

Thanks a lot for the had work, and thanks in advance for the help!

Puputnik avatar Apr 11 '24 15:04 Puputnik

Hi @Puputnik,

Dorado 0.6.0 introduced the --emit-summary flag to dorado demux - it's not available directly on basecalling, but would this be sufficient?

malton-ont avatar Apr 12 '24 07:04 malton-ont

Hi @malton-ont,

Dorado's barcoding summary is insufficient for my group's needs, and we must continue to use guppy just for its superior summary.

When using the --emit-summary option, dorado demux prints a file with only three columns: filename (this is empty for me), read_id, and barcode.

Guppy prints additional fields in its barcoding_summary.txt, many of which I rely on for troubleshooting and/or in our workflows:

  • read direction
  • front & rear barcode IDs
  • front & rear scores, for custom filtering. I understand scores will likely be different for dorado.
  • refseq (reference sequence guppy matched against)
  • foundseq (actual mask + barcode sequence found in the read)
  • foundseq length
  • foundseq index (position in read)

I have used all of the above. At a minimum, however, our workflows require the read direction, barcode IDs, foundseq length, and foundseq index. We'll continue to use guppy until dorado can match this output. I'm happy to get into detail about our use case if it would help.

Thanks for your work on dorado and any consideration of this request!

aphorton avatar May 08 '24 20:05 aphorton