hifiasm icon indicating copy to clipboard operation
hifiasm copied to clipboard

Question about Hifiasm algorithm

Open cjain7 opened this issue 1 year ago • 2 comments

In the review article "Genome assembly in the telomere-to-telomere era", @lh3 and @richarddurbin have mentioned that:

When constructing an overlap graph, we discard a read contained in longer reads. This apparently straightforward step may lead to assembly gaps...To alleviate this problem, hifiasm tries to rescue a contained read if having the read would patch an assembly gap. This heuristic works in simple cases but is not always reliable.

Can you please give some intuition for why this method is not reliable? Is this method difficult to implement, or is there a fundamental issue with this approach? Knowing your insights would be helpful.

Thanks!

cjain7 avatar Jan 02 '24 04:01 cjain7

There is no universally optimal route to assembly of variable-length reads, even when they are error-free. Figures 3 and 4 in the manuscript illustrate some of the issues. Fixes in one direction lead to potential problems in another direction. The best heuristics depend on the distribution of read lengths, the distribution of repeat lengths in the genome and the distribution of coverage.

Richard

From: Chirag Jain @.> Date: Tuesday, 2 January 2024 at 04:31 To: chhylp123/hifiasm @.> Cc: Richard Durbin @.>, Mention @.> Subject: [chhylp123/hifiasm] Question about Hifiasm algorithm (Issue #586)

In the review article "Genome assembly in the telomere-to-telomere era", @lh3https://github.com/lh3 and @richarddurbinhttps://github.com/richarddurbin have mentioned that:

When constructing an overlap graph, we discard a read contained in longer reads. This apparently straightforward step may lead to assembly gaps...To alleviate this problem, hifiasm tries to rescue a contained read if having the read would patch an assembly gap. This heuristic works in simple cases but is not always reliable.

Can you please give some intuition for why this method is not reliable? Is this method difficult to implement, or is there a fundamental issue with this approach? Knowing your insights would be helpful.

Thanks!

— Reply to this email directly, view it on GitHubhttps://github.com/chhylp123/hifiasm/issues/586, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA2FXZW4JKIMCXXT6N5FLPTYMOEQPAVCNFSM6AAAAABBJRSNQ2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DCOJTGAZTINA. You are receiving this because you were mentioned.Message ID: @.***>

richarddurbin avatar Jan 02 '24 13:01 richarddurbin

Can you please give some intuition for why this method is not reliable? Is this method difficult to implement, or is there a fundamental issue with this approach?

Suppose there is a 80kb homozygous region. You have one 100kb read on the paternal haplotype and many ~20kb reads on the maternal haplotype. Most ~20kb reads would be contained in the 100kb read. Hifiasm, in my understanding, only attempts to rescue one or a couple of reads. In this example, it would not work because it needs to build a path over contained reads and rescue all of them. In addition, here we know there are only two haplotypes. We may have multiple repeat haplotypes in satellites. Rescuing contained reads will be even harder in this case.

lh3 avatar Jan 02 '24 14:01 lh3