dorado icon indicating copy to clipboard operation
dorado copied to clipboard

How to correct Super ultra long read?

Open Jung19911124 opened this issue 8 months ago • 3 comments

Dear developers/users,

We use ONT ultra-long reads to assemble a species' super-repetitive region (estimated at around 20 Mb). The term ‘ultra-long read’ generally refers to reads with an N50 of around 100 kb, but we have found that it is impossible to assemble this region successfully using data with an N50 of 100 kb. Therefore, we decided to use reads with an N50 of 400-500 kb.

However, when we did error collection with dorado, we found that almost all 200 kb < reads were lost. Is it possible to do error collection while keeping the long reads?

Thanks in advance, Jung

Jung19911124 avatar Mar 30 '25 11:03 Jung19911124

Hi @Jung19911124 ,

However, when we did error collection with dorado, we found that almost all 200 kb < reads were lost.

You mean, almost all reads > 200kbp were lost? Are they completely removed, or just split into multiple pieces (e.g. do they have a :[number] appended to the back of the output header)?

There is nothing specific in dorado correct which would intentionally filter out super ultra long reads, but perhaps there need to be tweaks or there is a bug, so let's look into it.

What is the average coverage of your super ultra long read dataset?

Can you try running the overlap and correction steps separately? For example:

dorado correct --to-paf reads.fastq > overlaps.paf
dorado correct --from-paf overlaps.paf reads.fastq > corrected.fasta

Do the super ultra long reads have overlaps in the generated PAF file but are not present in the output? The target name column of the PAF file is the important one, as that refers to the target read which will be corrected. If there is a decent coverage of a target read but it did not produce a corrected sequence in the output, this is something I would like to look into.

Could you by any chance extract PAF overlaps for one or more such target reads, and also extract the FASTQ reads referred to by those overlaps (both query and target names, just make sure you deduplicate them), so I can try to reproduce the issue locally?

Thanks in advance!

svc-jstone avatar Apr 03 '25 08:04 svc-jstone

Thank you for your kind reply. I'll try the method you suggested for now. Give me some time.

Here is a summary of the results of the error collection using a portion of the data.

<head></head>
| No. of seqs | Sum  of length | mininal length | average length | maximum length | N50 
| Before | 2,681,436 | 9,864,973,163 | 5 | 3,679 | 1,863,916 | 101,585
| After | 132,943 | 2,388,294,629 | 2 | 17,964.80 | 255,295 | 25,074

Best, Jung

Jung19911124 avatar Apr 03 '25 09:04 Jung19911124

I see what you mean that the corrected reads are not as long as the input ones (e.g. the difference in the maximum_length) column.

One potential issue is that such long reads can also be: (1) chimeric, e.g. missing adapters, which means that error correction breaks them up based on alignment gaps, or (2) have a very repetitive region with no coverage (e.g. low a complexity region where Minimap2 filters out all the reads covering this region). In both cases, there is a coverage gap, and Dorado Correct (and HERRO) by design break reads if there is a coverage gap and produce multiple output pieces (with a :[number] appended to the back of the header).

Try looking for the header of that longest read in the output and checking to see how many pieces it has? If you want to dig deep into it, you can also check the alignments to see if there is a coverage gap somewhere.

svc-jstone avatar Apr 03 '25 11:04 svc-jstone

This sounds like expected behaviour. If you have further information, please re-open the ticket.

malton-ont avatar Oct 13 '25 14:10 malton-ont