C-Phasing

Phasing and scaffolding polyploid genomes based on Pore-C, HiFi-C/CiFi, Ultra-long, or Hi-C data

Introduction

One of the major problems with Hi-C scaffolding of polyploid genomes is a large proportion of ambiguous short-read mapping, leading to a high-level of switched or chimeric assemblies. Now, the long-read-based chromosome conformation capture technology, e.g., Pore-C, HiFi-C(CiFi), provides an effective way to overcome this problem. Here, we developed a new pipeline, namely C-Phasing, which is specifically tailored for polyploid phasing by leveraging the advantage of Pore-C or HiFi-C data. It also works on Hi-C data and diploid genome assembly.

The advantages of C-Phasing:

High speed.
High anchor rate of genome.
High accuracy of polyploid phasing.

Summary_of_CPhasing

Installation

Via activate_cphasing (Recommended)

linux-64 (x86-64)

## Download C-Phasing and install all dependencies
git clone https://github.com/wangyibin/CPhasing.git

## activate environment (For the first configuration, run it when the network is accessible.)
source ./CPhasing/bin/activate_cphasing

## deactivate environment
exit

linux-aarch64 download from github release

Via Anaconda

## Download C-Phasing and install all dependencies
git clone https://github.com/wangyibin/CPhasing.git
cd CPhasing
conda env create -f environment.yml
conda activate cphasing

## Add these command into .bash_profile or .bashrc
export PATH=/path/to/CPhasing/bin:$PATH
export PYTHONPATH=/path/to/CPhasing:$PYTHONPATH
## The hic pipeline require GLIBCXX_3.4.29, or you can add this command to your environment (.bash_profile)
export LD_LIBRARY_PATH=/path/to/anaconda3/envs/cphasing/lib:$LD_LIBRARY_PATH

One command pipeline of C-Phasing

The -n 8:4 parameter of the following commands means assembling a tetraploid (4) with 8 chromosome basic numbers. If you set -n 0:0 means partition in both rounds automatically, also support it set to -n 8:0 or -n 0:4.

Start from a pore-c data

cphasing pipeline -f draft.asm.fasta -pcd sample.fastq.gz -t 10 -n 8:4

Start from multiple pore-c data: specify multiple -pcd parameters.

cphasing pipeline -f draft.asm.fasta -pcd sample1.fastq.gz -pcd sample2.fastq.gz -t 10 -n 8:4

[!NOTE] If you want to run on cluster system and submit them to multiple nodes, you can use cphasing mapper and cphasing-rs porec-merge to generate the merged porec.gz file and input it by -pct parameter.

Start from a pore-c table (porec.gz), which is generated by cphasing mapper.

cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -n 8:4

Start from HiFi-C/CiFi data
Run pipeline or mapper with --mm2-params "-x map-hifi" parameter. And the output similar to the results of pore-c data.

cphasing pipeline -f draft.asm.fasta -pcd hific.fastq.gz --mm2-params "-x map-hifi"  -t 10 -n 8:4

[!NOTE] The mapping results of HiFi-C/CiFi is similar to Pore-C, such as output suffix with porec.gz, and process it use porec-merge, porec-intersect, et al.

Start from a paired-end Hi-C data

cphasing pipeline -f draft.asm.fasta -hic1 Lib_R1.fastq.gz -hic2 Lib_R2.fastq.gz -t 10 -n 8:4

[!NOTE] If you want to run multiple samples, you can use cphasing hic mapper and cphasing-rs pairs-merge to generate the merged pairs.gz file, and input it by -prs parameter.

[!NOTE] If the total length of your input genome is larger than 8 Gb, the -hic-mapper-k 27 -hic-mapper-w 14 should be specified, to avoid the error of chromap.

Start from a 4DN pairs file,

cphasing pipeline -f draft.asm.fasta -prs sample.pairs.gz -t 10 -n 8:4

Skip some steps

## skip steps 1.alleles and 2.prepare steps 
cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -ss 1,2

Perform only specified steps

## run 3.hyperpartition 
cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -s 3

Add the -hcr parameter to remove the greedy contacts (several regions contact with the whole genome) to improve the phasing quality, recommend specified --pattern to improve the performance of high confidence region identification.

cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -hcr -p AAGCTT

Curation by Juicebox

generate .assembly and .hic, depend on 3d-dna

cphasing pairs2mnd sample.pairs.gz -o sample.mnd.txt
cphasing utils agp2assembly groups.agp > groups.assembly
bash ~/software/3d-dna/visualize/run-assembly-visualizer.sh sample.assembly sample.mnd.txt

[!NOTE] if chimeric corrected, please use groups.corrected.agp and generate a new corrected.pairs.gz by cphasing-rs pairs-break

After curation

## convert assembly to agp
cphasing utils assembly2agp groups.review.assembly -n 8:4 
## or haploid or a homologous group
cphasing utils assembly2agp groups.review.assembly -n 8
## extract contigs from agp 
cphasing agp2fasta groups.review.agp draft.asm.fasta --contigs > contigs.fasta
## extract chromosome-level fasta from agp
cphasing agp2fasta groups.review.agp draft.asm.fasta > groups.review.asm.fasta

Rename

Rename and orient chromosome according a monoploid reference (or genome of closely related species).

cphasing rename -r mono.fasta -f draft.asm.fasta -a groups.review.agp -t 20

[!NOTE] To reduce the time consumed, we only align the first haplotype (g1) to the monoploid, which the orientation among different haplotypes has already been set to the same in the scaffolding step. If not, you can set —-unphased to align all haplotypes to the monoploid to adjust the orientation.

Pipeline of Ultra-long data [Optional]

C-Phasing enable to use ultra-long to correct chimeric and identify the high confidence regions (HCRs) to help assembly.

hitig tutorial

More details please check the documentation:
Documentation | 中文文档

Citation

If you use C-Phasing in your work, please cite:

C-Phasing

Yibin Wang, Ping Zhao, Xiaofei Zeng, Jiaxin Yu, Aoqian Dong, Yi Liu, Mengwei Jiang, Fang Wang, Xiao Chen, Shengcheng Zhang, Shuai Chen, Yuqing Gong, Yixing Zhang, Ruicai Long, Maojun Wang, Haibao Tang and Xingtan Zhang. Enhanced Pore-C with C-Phasing Enables Chromosomal-Scale, Haplotype-Resolved Assembly of Ultra-Complex Genomes, 05 November 2025, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-7343323/v1]
And HapHiC:

Xiaofei Zeng, Zili Yi, Xingtan Zhang, Yuhui Du, Yu Li, Zhiqing Zhou, Sijie Chen, Huijie Zhao, Sai Yang, Yibin Wang, Guoan Chen. Chromosome-level scaffolding of haplotype-resolved assemblies using Hi-C data without reference genomes. Nature Plants, 10:1184-1200. doi: https://doi.org/10.1038/s41477-024-01755-3
And ALLHiC

Xingtan Zhang, Shengcheng Zhang, Qian Zhao, Ray Ming, Haibao Tang. (2019) Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nature Plants, 5:833-845. doi: https://doi.org/10.1038/s41477-019-0487-8

For Hi-C data

For Hi-C data, users may also consider using our alternative software, HapHiC (https://github.com/zengxiaofei/HapHiC), which is specifically designed for Hi-C data and has demonstrated strong performance across multiple projects.

CPhasing
CPhasing copied to clipboard

Metadata

C-Phasing

Introduction

Installation

Via activate_cphasing (Recommended)

Via Anaconda

One command pipeline of C-Phasing

Curation by Juicebox

Rename

Pipeline of Ultra-long data [Optional]

hitig tutorial

More

Citation

For Hi-C data

← Metadata

Owner

Metadata

CPhasing CPhasing copied to clipboard

Metadata

C-Phasing

Introduction

Installation

Via activate_cphasing (Recommended)

Via Anaconda

One command pipeline of C-Phasing

Curation by Juicebox

Rename

Pipeline of Ultra-long data [Optional]

hitig tutorial

More

Citation

For Hi-C data

← Metadata

Owner

Metadata

CPhasing
CPhasing copied to clipboard