ALLHiC
ALLHiC copied to clipboard
haplotype assembly for a polyploid
Hi, This follows the issue #80
The problem was the group assignment to the chromosomes homologous groups. We are working on the assembly of a tetraploid potato. If we do not separate the contigs into chromosomes groups we end up with a final supercontig of 2.2 Gb and a total length of 2.4Gb in groups.asm.fasta.
But, when we divided the contigs into the 12 homologous chromosomes divided by two, thus 24 (chri_1, chri_2) because we are using as a reference a haplotype assembly of a diploid potato, the contigs size are reasonable.
We follow the script for the polyploid sugarcane based on sorghum, we believe you were able to recover the haplotype assembly for each chromosome? In that case you had only one folder for each chromosome right?
What is your suggestion to obtain a haplotype assembly of a tetraploid potato using a diploid reference assembly to separate the contigs? Is there any way to use allhic without using a reference?
Did you use purge_haplotigs or something similar after using allhic?
Looking forward to an answer Best Regards
Hi @phrh I guess there might be a misunderstanding of the auto polyploid assembly. Actually, the chromosome number of tetraploid potato is 48 (2n=4x=48). Thus 48 pseudo-chromosomes should be expected rather than 24. The sugarcane AP85-441 we were working on has 32 chromosomes (1n=4x=32) and we therefore separated four haplotypes. There is no need to use purge_haplotigs. We expect to assemble as much sequences as possible.
Hi, Maybe I was not clear. (1) Is there any way to use allhic without using a reference? (2) what would you suggest to use as a reference for a tetraploid potato to divide de hi-c contigs, a haplotype version of a diploid potato (with two folders for each chromosome)? We used this assembly, we consider is the best one for the diploid potato. But we are having problems to know how to find the four haplotypes
For example for chromosome 1, we use a k from 1 to 4. In the case of k =4, we use each k inside chr1_1 and chr1_2 folders. Moreover, we join the content of Chr1_1 and Chr1_2 into Chr1 and use also a k=4, but the difference in size of the resulting groups is big.
k=4 Chr1_1 | Chr1_2 | Chr1 104,023,204 | 76,792,460 | 235,470,213 10,993,786 | 31,758,127 | 28,995,993 1,231,480 | 17,286,196 | 14,487,011 820,152 | 21,025,674 | 6,149,390
What was your experience with sugarcane?
Hi @phrh , (1) The reference genome is necessary to separate allelic contigs and therefore we have to use a reference in ALLHiC if we are working on polyploid assembly and phasing. For diploid genome, there is no need to use a reference. (2) The reference genome should be a monoploid assembly of a diploid or a polyploid. That means the reference only contains one set of these haplotypes rather than a phased assembly (i.e. a haplotype-resolved version assembled by Zhou et al., 2020). Does that make sense?
Yes, thanks we are now using a monoploid assembly.
However, we followed the issue https://github.com/tangerzhang/ALLHiC/issues/75 and we decided to (1) used the ALLHiC_corrector and (2) filter hi-C reads using Hi-C explorer. Can you tell me what does the ALLHIC_corrector does?
ALLHIC_corrector detects and corrects misjoined contigs based on Hi-C signals. It utilizes the core algorithm for contig correction from 3D-DNA but does not include the iteratively scaffolding steps, which saves a lot of time for big genomes. In addition, ALLHiC_corrector takes bam files (from bwa mem) as input and does not need juice box mapping. We think it might be convenient for ALLHiC users.