How can I recover sequences from collapsed homologous regions?
Hello,
Thank you for devoloping such good tool for polyploid genome assembly. Now I am assembling a tetraploid plant genome (AABB) with two closely related subgenomes using HiFi and Hi-C data (without ONT). My goal is to obtain two complete haploid genomes (A1B1 and A2B2). Hifiasm works very well overall, but I noticed that in the unitigs some sequences are incomplete.
For example, sequences that should normally exist in four copies [A1, A2, B1, B2] end up with only two copies [A1, B1] or three copies [A1, A2, B1] in the final assembly results [p_utg], while the missing copies/collapsed regions seems to be chimeric into these sequences (please check figure I uploaded). This genome landscape also reported in new assembled sweetpotato genome, please check (Fig. 2c in https://doi.org/10.1038/s41477-025-02079-6)
I found that #21 provides an example of collapse, but I am not sure whether it applies to my case. Meanwhile, I am also working on a genome with whole-chromosome collapses, and according to the user manual, the collapse subprogram is not suitable for handling entire chromosome collapses.
Thank you for your patience. My question is whether Cphasing could help improve genome assembly in these two situations. In addition, could you provide some suggestions on how to recover sequences from a completely collapsed chromosome?
Thank you
Hi, sorry for the late reply.
The collapsed regions in this genome show a clear continuity between two parts located on different chromosomes. You can directly copy these two parts from the AGP file to generate a new chromosome in a new AGP file, and then manually curate in Juicebox.
After duplicating this new collapsed chromosome, you can follow the steps below to generate new .hic and .assembly files:
# Rename duplicated contigs to new names
cphasing collapse agp-dup new.agp -o dup.agp
# Randomly assign links from raw contigs to duplicated contigs
# (to avoid another round of mapping)
cphasing collapse pairs-dup input.pairs.pqs duplicated.contigs.txt -o dup.pairs.pqs
# Then follow `to_hic.sh` in `4.scaffolding` to generate the new `.hic` and `.assembly` files
Note: The duplicated.contigs.txt file should contain two columns: the source contig and its duplicated name, e.g.:
utg1\tutg1_d2
utg2\tutg2_d2
Hi,
Thank you for your reply, this software is truly a powerful toolbox. I will try the protocol you mentioned, but I am not sure if I understood the details correctly.
Considering these chromosomes were created by HapHic + manually curation. I should do:
- Convert the
.assemblyfile to.agpformat. Then detect the collapsed regions and push/paste those regions, without any edits, to the end of the.agpfile. this step will outputnew.agp. - Run the commands you suggested to generate new
.hicand.assemblyfiles.
Is that correct? I’m sorry, but I’m not very familiar with these tools. In addition, how should I handle cases where entire chromosomes are collapsed? Could I apply the same protocol?
Many thanks