Long-read-assembler-comparison icon indicating copy to clipboard operation
Long-read-assembler-comparison copied to clipboard

The "true" assembly for the five real datasets

Open lh3 opened this issue 5 years ago • 1 comments

Thanks for the evaluation. This is very comprehensive. @ruanjue and I would like to investigate why wtdbg2 didn't work well. We wonder if you have the "true" assembly (i.e. the unicycler hybrid assembly) for the five samples in evaluation: SAMN10819801, SAMN10819805, SAMN10819807, SAMN10819813 and SAMN10819815. Thanks in advance!

By the way, you mentioned:

To encourage minimap2 to align through lower identity regions in the assembly (as opposed to stopping one alignment and starting another), we also used the -r 10000 -g 10000 options.

These two parameters only help with long gaps. To align through low-identity regions, it would be better to apply a large -z. When evaluating contiguity in the wtdbg2 manuscript, I was using

minimap2 --paf-no-hit -cxasm20 -r2k -z1000,500

You also mentioned:

Wtdbg2 assemblies often contain junky regions hundreds of base pairs in size.

Using a large -z will greatly help to align through such regions. I wonder how much contiguity is affected by such local issues. Of course, local issues are still issues, but knowing where the issues come from (assembly vs consensus) will help us to improve wtdbg2 further. Thank you.

PS: -z controls the Z-drop heuristic, which is similar to BLAST's X-drop. Roughly speaking, minimap2 stops alignment when the alignment score drops more than -z along the 45º line in the DP matrix. The second parameter of -z is for inversion alignment, less relevant to alignment through low-identity regions.

PSS: I have just tried wtdbg2 on the 5 samples. Wtdbg2 largely gives one contig in 9 out of 10 (5 samples times 2 technologies) cases. I speculate its poor contiguity was more often caused by low-identity regions.

lh3 avatar May 31 '19 15:05 lh3

Dear Ryan R. Wick and Kathryn E. Holt,

Wtdbg2 aims to provide a quick solution for long reads assembly, it often leave the polishing step to other tools, like quiver/pilon. However, I find people are interested in its own consensus quality recently. It is a good opportunity for wtdbg2 to improve its consensus quality. I just wrote a script best_sam_hits2longreads.pl to select best alignments for wtpoa-cns, it works well, and produce much better consensus bases now. Please see it at https://github.com/ruanjue/wtdbg2#getting-started .

Could you kindly send the reference sequences used in evaluation of real dataset SAMN10819801, SAMN10819805, SAMN10819807, SAMN10819813 and SAMN10819815? It will be important for @lh3 and I to find what's wrong in bacteria assembly using wtdbg2. Thanks in advance!

Jue

ruanjue avatar Jun 02 '19 03:06 ruanjue