HapHiC pore-C data

Hello, @zengxiaofei

Can HapHiC input porec data? If not, what adjustments do I need to make?

Thank you in advance for your response

Dec 18 '23 01:12 socialhang

Hi @socialhang,

I have a plan to support Pore-C data, but it is not a high priority at the moment. I haven't conducted any tests on Pore-C data yet, so I am unable to provide any useful information about it at this time.

Best regards, Xiaofei

Dec 18 '23 05:12 zengxiaofei

Hi @socialhang,

I have conducted an experiment on the test data of https://github.com/epi2me-labs/wf-pore-c. You can try the following steps with your own FASTA and FASTQ files:

(1) Execute the wf-pore-c pipeline to align Pore-C data and create a mock paired-end BAM file:

source ~/software/miniconda3/bin/activate nextflow

# modify --fastq, --ref, --cutter, and --chunk_size
nextflow run epi2me-labs/wf-pore-c \
             --fastq /work/software/genome/wf-pore-c-265a1a4/test_data/porec_test.concatemers.fastq --chunk_size 100 \
             --ref /work/software/genome/wf-pore-c-265a1a4/test_data/porec_test.fasta \
             --cutter NlaIII \
             --paired_end_minimum_distance 100 --paired_end_maximum_distance 200 --hi_c --mcool --paired_end

conda deactivate

Then, you will obtain a paired-end BAM file named null.ns.bam in the output/paired_end directory.

(2) Rename the Hi-C reads in null.ns.bam using a custom script rename_reads_for_bam.py and filter Hi-C reads:

python3 rename_reads_for_bam.py null.ns.bam | samtools view - -@ 14 -S -h -b -F 3340 -o HiC.bam
/path/to/HapHiC/utils/filter_bam.py HiC.bam 1 --threads 14 | samtools view -b -@ 14 -o HiC.filtered.bam

Finally, you can use HiC.filtered.bam for HapHiC scaffolding pipeline.

Here is the custom script: rename_reads_for_bam.py.gz

Please be aware that this method has not been fully tested. I welcome your feedback on any attempts you make. This will be of great help. Once this method is validated, I will integrate it into HapHiC.

Best regards, Xiaofei

Dec 18 '23 08:12 zengxiaofei

Thanks a lot！

I will try today！

Dec 18 '23 08:12 socialhang

hello! @zengxiaofei

I'm back! I'm so sorry, the wf-pore-c software could not run on my server. So could I use falign to make the bam for replacing wf-pore-c？

Dec 20 '23 00:12 socialhang

Hi @socialhang,

You could try it. It seems that falign can produce pairwise SAM files that compatible with SALSA2 using the command sam2salsa2 in u4falign.

Best, Xiaofei

Dec 20 '23 02:12 zengxiaofei

OK！

Dec 20 '23 02:12 socialhang

Hello! @zengxiaofei

I produce pairwise bam file using falign , but some errors happend.

Traceback (most recent call last):
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 483, in <module>
    main()
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 464, in main
    inflation = haphic_cluster(args)
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 323, in haphic_cluster
    HapHiC_cluster.run(args, log_file=LOG_FILE)
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_cluster.py", line 2361, in run
    flank_link_matrix, frag_index_dict = dict_to_matrix(
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_cluster.py", line 283, in dict_to_matrix
    shape = len(frag_set)
TypeError: object of type 'NoneType' has no len()
Traceback (most recent call last):
  File "/root/software/HapHiC_1.0.2/haphic", line 96, in <module>
    subprocess.run(commands, check=True)
  File "/root/miniconda3/envs/haphic/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py', 'hifiasm_hic_ul_p_utg.fa', 'small.1.bam', '32', '--remove_allelic_links', '4', '--Nx', '100', '--gfa', 'hifiasm.asm.hic.p_utg.gfa', '--min_RE_sites', '1', '--min_links', '1', '--min_link_density', '0', '--threads', '40', '--max_ctg_len', '400000']' returned non-zero exit status 1.

I think its because of my bam format.
small.1.bam.gz

Can you help me point out the errors in my bam file？

Thanks a lot!

Dec 24 '23 05:12 socialhang

Hi @socialhang,

Could you please paste or upload 01.cluster/HapHiC_cluster.log?

Best, Xiaofei

Dec 25 '23 03:12 zengxiaofei

OK, waiting a minute

Dec 25 '23 03:12 socialhang

HapHiC_cluster.log HapHiC_cluster.e.txt

Dec 25 '23 03:12 socialhang

Is the smalll.1.bam that you uploaded the same one you used for HapHiC scaffolding, or just a portion of it?

Dec 25 '23 04:12 zengxiaofei

Is the smalll.1.bam that you uploaded the same one you used for HapHiC scaffolding, or just a portion of it?

Since my bam file is 400Gb in size, smalll.1.bam is part of the small data I extracted. In addition, I have also run haphic with this data, and the error reporting is completely consistent with big datasmalll.1.bam.

Dec 25 '23 05:12 socialhang

Please use this script (modify_bam_for_falign.py.gz) to modify the bam file:

python3 modify_bam_for_falign.py small.1.bam | samtools view -b -o new.bam

Dec 25 '23 06:12 zengxiaofei

Please use this script (modify_bam_for_falign.py.gz) to modify the bam file:
python3 modify_bam_for_falign.py small.1.bam | samtools view -b -o new.bam

OK! Thanks! I will try!

Dec 25 '23 06:12 socialhang

some errors happen:

2023-12-25 09:02:22 <HapHiC_pipeline.py> [main] Pipeline started, HapHiC version: 1.0.2 (update: 2023.12.16)
2023-12-25 09:02:22 <HapHiC_pipeline.py> [main] Python version: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]
2023-12-25 09:02:22 <HapHiC_pipeline.py> [haphic_cluster] Step1: Execute preprocessing and Markov clustering for contigs...
2023-12-25 09:02:22 <HapHiC_cluster.py> [run] Program started, HapHiC version: 1.0.2 (update: 2023.12.16)
2023-12-25 09:02:22 <HapHiC_cluster.py> [run] Python version: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]
2023-12-25 09:02:22 <HapHiC_cluster.py> [parse_fasta] Parsing input FASTA file...
2023-12-25 09:02:34 <HapHiC_cluster.py> [parse_gfa] Parsing input gfa file(s)...
2023-12-25 09:02:38 <HapHiC_cluster.py> [stat_fragments] Making some statistics of fragments (contigs / bins)
2023-12-25 09:02:38 <HapHiC_cluster.py> [stat_fragments] bin_size is calculated to be 2000000 bp
2023-12-25 09:02:41 <HapHiC_cluster.py> [parse_bam] Parsing input BAM file...
Traceback (most recent call last):
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 483, in <module>
    main()
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 464, in main
    inflation = haphic_cluster(args)
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 323, in haphic_cluster
    HapHiC_cluster.run(args, log_file=LOG_FILE)
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_cluster.py", line 2304, in run
    full_link_dict, flank_link_dict, HT_link_dict, clm_dict, frag_link_dict, ctg_coord_dict, ctg_pair_to_frag = parse_bam(
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_cluster.py", line 1481, in parse_bam
    frag_len_i, frag_len_j = frag_len_dict[frag_i], frag_len_dict[frag_j]
KeyError: 'utg000028l_bin4'
Traceback (most recent call last):
  File "/root/software/HapHiC_1.0.2/haphic", line 96, in <module>
    subprocess.run(commands, check=True)
  File "/root/miniconda3/envs/haphic/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py', 'hifiasm_hic_ul_p_utg.fa', 'ul9.HiC.filter.correct0.bam', '32', '--remove_allelic_links', '4', '--Nx', '100', '--gfa', 'hifiasm.asm.hic.p_utg.gfa', '--min_RE_sites', '1', '--min_links', '1', '--min_link_density', '0', '--threads', '40', '--max_ctg_len', '400000']' returned non-zero exit status 1.

@zengxiaofei

Dec 25 '23 09:12 socialhang

HapHiC_cluster.2.log HapHiC_cluster.e.2.log

Dec 25 '23 09:12 socialhang

It seems that the position in the BAM file (the fourth column for pos or the eighth column for mpos ) exceeds the length of the contig.

The error message shows that the bin_size was calculated to be 2 Mb, and the error occurred when the mapping position of a read in the BAM file was within the range of [6000001, 8000000] (bin ID utg000028l_bin4). However, based on the bam file you uploaded, the contig utg000028l is only 4,798,928 bp in length.

Dec 25 '23 10:12 zengxiaofei

Hi！@zengxiaofei

I've found the problem and I've updated the script. correct_sam.py.zip Now the process is running normally, but there are clusters of homologous chromosomes in the results, and I think the signal between the different haploids is not screened clean. How can I adjust the parameters?

Dec 26 '23 03:12 socialhang

Could you please provide a full view of the Hi-C contact heatmap that includes scaffold boundaries (called superscaffold in Juicebox, and surrounded by blue outlines)?

Dec 26 '23 04:12 zengxiaofei

Could you please provide a full view of the Hi-C contact heatmap that includes scaffold boundaries (called superscaffold in Juicebox, and surrounded by blue outlines)?

could you give me an email address？I will send you the hic heat map privately. @zengxiaofei

Dec 26 '23 05:12 socialhang

You could send email to xiaofei_zeng#whu.edu.cn (replace # with @).

Dec 26 '23 06:12 zengxiaofei

@socialhang or @zengxiaofei have you somehow managed to adjust the parameters to improve your contactmap and distinguish between the different haploids ?

Apr 25 '24 16:04 LikeLucasLab

Hi @LikeLucasLab,

The primary challenge faced when working with Pore-C data lies in achieving accurate mapping. We conducted some tests using falign, and adapted HapHiC to be compatible with this aligner through a Python script. Although Pore-C data showed performance comparable to NGS-based Hi-C data in scaffolding haplotype-collapsed assemblies, it exhibited much lower mapping accuracy in the context of distinguishing haplotypes, resulting in much stronger Hi-C signals being observed between haplotigs. We tried different MAPQ filtering criteria, but the results did not have significant improvements.

There may be some adjustments in HapHiC for Pore-C data in the future updates. When using the current version of HapHiC for scaffolding haplotype-phased assemblies, please always add the --remove_allelic_links parameter to remove Hi-C links between allelic contigs. Lower --nwindows or --concordance_ratio_cutoff value could potentially improve the scaffolding results.

May 07 '24 03:05 zengxiaofei

Hi @zengxiaofei Is it possible to improve mapping accuracy by correcting pore-c reads using herro (https://github.com/lbcb-sci/herro)?

May 27 '24 04:05 csxie-666

Hi @csxie-666,

As a update to this issue, I currently recommend using minimap2 other than falign for Pore-C reads mapping. HapHiC has supported the BAM file output by minimap2 using a Python script.

Hi @zengxiaofei Is it possible to improve mapping accuracy by correcting pore-c reads using herro (https://github.com/lbcb-sci/herro)?

According to the author's description, herro was exclusively trained on human data (https://github.com/lbcb-sci/herro/issues/23). I'm not sure whether it is versatile for other species and ploidy levels. I'm trying out another haplotype-aware ONT read correction tool called DeChat on ONT ultra-long reads. This tool has been validated in simulated autopolyploids. However, I encounter a difficulty in running the program (refer to this issue: https://github.com/LuoGroup2023/DeChat/issues/3).

It is worth noting that Pore-C data may significantly differ from ONT whole-genome sequencing data in the context of read correction. This is because Pore-C reads consist of sequences originating from different loci, resulting in lower coverage between overlapping Pore-C reads compared to ONT whole-genome sequencing reads.

Best, Xiaofei

May 28 '24 09:05 zengxiaofei

I have included the best practice for handling Pore-C reads in the updated README.md.

Jul 10 '24 02:07 zengxiaofei

HapHiC HapHiC copied to clipboard

pore-C data

HapHiC
HapHiC copied to clipboard