HapHiC
HapHiC copied to clipboard
pore-C data
Hello, @zengxiaofei
Can HapHiC input porec data? If not, what adjustments do I need to make?
Thank you in advance for your response
Hi @socialhang,
I have a plan to support Pore-C data, but it is not a high priority at the moment. I haven't conducted any tests on Pore-C data yet, so I am unable to provide any useful information about it at this time.
Best regards, Xiaofei
Hi @socialhang,
I have conducted an experiment on the test data of https://github.com/epi2me-labs/wf-pore-c. You can try the following steps with your own FASTA and FASTQ files:
(1) Execute the wf-pore-c pipeline to align Pore-C data and create a mock paired-end BAM file:
source ~/software/miniconda3/bin/activate nextflow
# modify --fastq, --ref, --cutter, and --chunk_size
nextflow run epi2me-labs/wf-pore-c \
--fastq /work/software/genome/wf-pore-c-265a1a4/test_data/porec_test.concatemers.fastq --chunk_size 100 \
--ref /work/software/genome/wf-pore-c-265a1a4/test_data/porec_test.fasta \
--cutter NlaIII \
--paired_end_minimum_distance 100 --paired_end_maximum_distance 200 --hi_c --mcool --paired_end
conda deactivate
Then, you will obtain a paired-end BAM file named null.ns.bam
in the output/paired_end
directory.
(2) Rename the Hi-C reads in null.ns.bam
using a custom script rename_reads_for_bam.py
and filter Hi-C reads:
python3 rename_reads_for_bam.py null.ns.bam | samtools view - -@ 14 -S -h -b -F 3340 -o HiC.bam
/path/to/HapHiC/utils/filter_bam.py HiC.bam 1 --threads 14 | samtools view -b -@ 14 -o HiC.filtered.bam
Finally, you can use HiC.filtered.bam
for HapHiC scaffolding pipeline.
Here is the custom script: rename_reads_for_bam.py.gz
Please be aware that this method has not been fully tested. I welcome your feedback on any attempts you make. This will be of great help. Once this method is validated, I will integrate it into HapHiC.
Best regards, Xiaofei
Thanks a lot!
I will try today!
hello! @zengxiaofei
I'm back! I'm so sorry, the wf-pore-c software could not run on my server. So could I use falign to make the bam for replacing wf-pore-c?
Hi @socialhang,
You could try it. It seems that falign
can produce pairwise SAM files that compatible with SALSA2 using the command sam2salsa2
in u4falign
.
Best, Xiaofei
OK!
Hello! @zengxiaofei
I produce pairwise bam file using falign , but some errors happend.
Traceback (most recent call last):
File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 483, in <module>
main()
File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 464, in main
inflation = haphic_cluster(args)
File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 323, in haphic_cluster
HapHiC_cluster.run(args, log_file=LOG_FILE)
File "/root/software/HapHiC_1.0.2/scripts/HapHiC_cluster.py", line 2361, in run
flank_link_matrix, frag_index_dict = dict_to_matrix(
File "/root/software/HapHiC_1.0.2/scripts/HapHiC_cluster.py", line 283, in dict_to_matrix
shape = len(frag_set)
TypeError: object of type 'NoneType' has no len()
Traceback (most recent call last):
File "/root/software/HapHiC_1.0.2/haphic", line 96, in <module>
subprocess.run(commands, check=True)
File "/root/miniconda3/envs/haphic/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py', 'hifiasm_hic_ul_p_utg.fa', 'small.1.bam', '32', '--remove_allelic_links', '4', '--Nx', '100', '--gfa', 'hifiasm.asm.hic.p_utg.gfa', '--min_RE_sites', '1', '--min_links', '1', '--min_link_density', '0', '--threads', '40', '--max_ctg_len', '400000']' returned non-zero exit status 1.
I think its because of my bam format.
small.1.bam.gz
Can you help me point out the errors in my bam file?
Thanks a lot!
Hi @socialhang,
Could you please paste or upload 01.cluster/HapHiC_cluster.log
?
Best, Xiaofei
OK, waiting a minute
Is the smalll.1.bam
that you uploaded the same one you used for HapHiC scaffolding, or just a portion of it?
Is the
smalll.1.bam
that you uploaded the same one you used for HapHiC scaffolding, or just a portion of it?
Since my bam file is 400Gb in size, smalll.1.bam
is part of the small data I extracted. In addition, I have also run haphic
with this data, and the error reporting is completely consistent with big datasmalll.1.bam.
Please use this script (modify_bam_for_falign.py.gz) to modify the bam file:
python3 modify_bam_for_falign.py small.1.bam | samtools view -b -o new.bam
Please use this script (modify_bam_for_falign.py.gz) to modify the bam file:
python3 modify_bam_for_falign.py small.1.bam | samtools view -b -o new.bam
OK! Thanks! I will try!
some errors happen:
2023-12-25 09:02:22 <HapHiC_pipeline.py> [main] Pipeline started, HapHiC version: 1.0.2 (update: 2023.12.16)
2023-12-25 09:02:22 <HapHiC_pipeline.py> [main] Python version: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]
2023-12-25 09:02:22 <HapHiC_pipeline.py> [haphic_cluster] Step1: Execute preprocessing and Markov clustering for contigs...
2023-12-25 09:02:22 <HapHiC_cluster.py> [run] Program started, HapHiC version: 1.0.2 (update: 2023.12.16)
2023-12-25 09:02:22 <HapHiC_cluster.py> [run] Python version: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]
2023-12-25 09:02:22 <HapHiC_cluster.py> [parse_fasta] Parsing input FASTA file...
2023-12-25 09:02:34 <HapHiC_cluster.py> [parse_gfa] Parsing input gfa file(s)...
2023-12-25 09:02:38 <HapHiC_cluster.py> [stat_fragments] Making some statistics of fragments (contigs / bins)
2023-12-25 09:02:38 <HapHiC_cluster.py> [stat_fragments] bin_size is calculated to be 2000000 bp
2023-12-25 09:02:41 <HapHiC_cluster.py> [parse_bam] Parsing input BAM file...
Traceback (most recent call last):
File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 483, in <module>
main()
File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 464, in main
inflation = haphic_cluster(args)
File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 323, in haphic_cluster
HapHiC_cluster.run(args, log_file=LOG_FILE)
File "/root/software/HapHiC_1.0.2/scripts/HapHiC_cluster.py", line 2304, in run
full_link_dict, flank_link_dict, HT_link_dict, clm_dict, frag_link_dict, ctg_coord_dict, ctg_pair_to_frag = parse_bam(
File "/root/software/HapHiC_1.0.2/scripts/HapHiC_cluster.py", line 1481, in parse_bam
frag_len_i, frag_len_j = frag_len_dict[frag_i], frag_len_dict[frag_j]
KeyError: 'utg000028l_bin4'
Traceback (most recent call last):
File "/root/software/HapHiC_1.0.2/haphic", line 96, in <module>
subprocess.run(commands, check=True)
File "/root/miniconda3/envs/haphic/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py', 'hifiasm_hic_ul_p_utg.fa', 'ul9.HiC.filter.correct0.bam', '32', '--remove_allelic_links', '4', '--Nx', '100', '--gfa', 'hifiasm.asm.hic.p_utg.gfa', '--min_RE_sites', '1', '--min_links', '1', '--min_link_density', '0', '--threads', '40', '--max_ctg_len', '400000']' returned non-zero exit status 1.
@zengxiaofei
It seems that the position in the BAM file (the fourth column for pos or the eighth column for mpos ) exceeds the length of the contig.
The error message shows that the bin_size
was calculated to be 2 Mb, and the error occurred when the mapping position of a read in the BAM file was within the range of [6000001, 8000000] (bin ID utg000028l_bin4
). However, based on the bam file you uploaded, the contig utg000028l
is only 4,798,928 bp in length.
Hi!@zengxiaofei
I've found the problem and I've updated the script. correct_sam.py.zip Now the process is running normally, but there are clusters of homologous chromosomes in the results, and I think the signal between the different haploids is not screened clean. How can I adjust the parameters?
Could you please provide a full view of the Hi-C contact heatmap that includes scaffold boundaries (called superscaffold in Juicebox, and surrounded by blue outlines)?
Could you please provide a full view of the Hi-C contact heatmap that includes scaffold boundaries (called superscaffold in Juicebox, and surrounded by blue outlines)?
could you give me an email address?I will send you the hic heat map privately. @zengxiaofei
You could send email to xiaofei_zeng#whu.edu.cn (replace # with @).
@socialhang or @zengxiaofei have you somehow managed to adjust the parameters to improve your contactmap and distinguish between the different haploids ?
Hi @LikeLucasLab,
The primary challenge faced when working with Pore-C data lies in achieving accurate mapping. We conducted some tests using falign, and adapted HapHiC to be compatible with this aligner through a Python script. Although Pore-C data showed performance comparable to NGS-based Hi-C data in scaffolding haplotype-collapsed assemblies, it exhibited much lower mapping accuracy in the context of distinguishing haplotypes, resulting in much stronger Hi-C signals being observed between haplotigs. We tried different MAPQ filtering criteria, but the results did not have significant improvements.
There may be some adjustments in HapHiC for Pore-C data in the future updates. When using the current version of HapHiC for scaffolding haplotype-phased assemblies, please always add the --remove_allelic_links
parameter to remove Hi-C links between allelic contigs. Lower --nwindows
or --concordance_ratio_cutoff
value could potentially improve the scaffolding results.
Hi @zengxiaofei Is it possible to improve mapping accuracy by correcting pore-c reads using herro (https://github.com/lbcb-sci/herro)?
Hi @csxie-666,
As a update to this issue, I currently recommend using minimap2 other than falign for Pore-C reads mapping. HapHiC has supported the BAM file output by minimap2 using a Python script.
Hi @zengxiaofei Is it possible to improve mapping accuracy by correcting pore-c reads using herro (https://github.com/lbcb-sci/herro)?
According to the author's description, herro was exclusively trained on human data (https://github.com/lbcb-sci/herro/issues/23). I'm not sure whether it is versatile for other species and ploidy levels. I'm trying out another haplotype-aware ONT read correction tool called DeChat on ONT ultra-long reads. This tool has been validated in simulated autopolyploids. However, I encounter a difficulty in running the program (refer to this issue: https://github.com/LuoGroup2023/DeChat/issues/3).
It is worth noting that Pore-C data may significantly differ from ONT whole-genome sequencing data in the context of read correction. This is because Pore-C reads consist of sequences originating from different loci, resulting in lower coverage between overlapping Pore-C reads compared to ONT whole-genome sequencing reads.
Best, Xiaofei
I have included the best practice for handling Pore-C reads in the updated README.md.