hifiasm icon indicating copy to clipboard operation
hifiasm copied to clipboard

missing 8Mb sequences in the assembly

Open xzhoubayer opened this issue 2 months ago • 6 comments

We assembled a set of 11 genomes of the same crop species with hifiasm version 0.19.8-r603. One of the 11, lineA, was an outlier in terms of overall assembly size. Particularly, 8Mb of the 5’-end of one of the chromosomes was missing in the assembly (when compared to others in the collection). Given the size and genome content of the missing sequence, we believe this missing region is not a biological difference but must be a technical artifact.

A few simple lines of evidence supported our hypothesis the region was incorrectly missing:

When the HiFi reads of lineA were mapped against the assembly of a highly related line ("lineB") using minimap2, we found that the HiFi reads of lineA had an even distribution of depth of coverage over every chromosome of lineB, including the 8Mb sequences missing at 5’-end of lineA chr4.

Moreover, there are more 11,000 HiFi reads from the lineA HiFi library that mapped to the 5’-end of lineB, including in the 8Mb region in question. Critically, none of these HiFi reads from lineA appeared in the assembly graph file (asm.bp.p_ctg.noseq.gfa). We found the same result using an older version of hifiasm (version 0.18.2).

As a troubleshooting measure, we assembled lineA with the same set of HiFi reads using canu2.2. In the canu assembly, the 8Mb was recovered and assembled in the correct position.

xzhoubayer avatar Apr 29 '24 15:04 xzhoubayer