ALLHiC run partition.pl error

hi, when I run the perl script partition.pl following the tutorial, I met a error: Use of uninitialized value $ctg in hash element at ~/01_software/ALLHiC/scripts/partition.pl line 66,<IN> line 11. Use of uninitialized value $ctg in hash element at ~/01_software/ALLHiC/scripts/partition.pl line 66,<IN> line 14. Use of uninitialized value $ctg in hash element at ~01_software/ALLHiC/scripts/partition.pl line 66,<IN> line 22. .... Use of uninitialized value $ctg in hash element at~/01_software/ALLHiC/scripts/partition.pl line 66,<IN> line 45577. [faidx] Could not build fai index wrk_dir/13728379/seq.fasta.fai [main_samview] fail to read the header from "wrk_dir/13728379/sample.clean.sam".

I don't know how to solve it.I would appreciate a lot if you can give me some advice on the issue. Best wishes!

Jun 19 '21 14:06 linshengnan09

Hi @linshengnan09 It seems that the contig name are inconsistent between Allele table and input.fasta. Would you please indicate us how did you generate the allele table? I would recommend a GMAP-based approach to generate the Allele.ctg.table. After that, you can use the improved partition version (https://github.com/tangerzhang/ALLHiC/blob/master/scripts/partition_gmap.py) to split homologous groups.

Jun 20 '21 04:06 tangerzhang

I generate the allele table use the GMAP-based approach as follows: gmap_build -D . -d DB target.genome gmap -D . -d DB -t 12 -f 2 -n 2 reference.cds.fasta > gmap.gff3 awk '$3 == "gene"' reference.gff | awk 'BEGIN{FS="\t|=|;";OFS="\t"}{print $1,$4,$5,$12".1"}' > target.bed perl gmap2AlleleTableBED.modify.pl target.bed

As you suggested，I use the improved partition version (https://github.com/tangerzhang/ALLHiC/blob/master/scripts/partition_gmap.py) to split homologous groups , it run well.

but I have two other questions. 1)As you show the example of scaffolding an auto polyploid sugarcane genome, when separate homologous groups to reduce scaffolding complexity, there show use the Allele.gene.table , and when build superscaffolds using ALLHiC pipeline ，there use the Allele.ctg.table，they are the same table, right? 2) when I use the script partition_gmap.pl to genertate the wrk_dir directory, each chromosome has a folder and has a list file and sequence file, how should I run the step Prune? Should I combine the seq.fasta of each chromosome? Thanks!

Jun 21 '21 13:06 linshengnan09

Hi @linshengnan09 I apologize for the misleading. The Allele.gene.table are not the same with Allele.ctg.table, and we have avoided to use Allele.gene.table in our next release. And I also modified the misleading parameters in the partition_gmap.py and partition_gmap.pl scripts. For the second question, ALLHiC_prune and the following scaffolding steps can be executed individually in each folder with the same Allele.ctg.table. After than, you can use dot-plot analysis to check the phasing results as we did in the sugarcane project (Figure 1. in Zhang, et., Nature Genetics, 2018).

Jun 22 '21 02:06 tangerzhang

Hi @tangerzhang As you suggested, I run each folder individually when scaffolding steps, just Chromosomes1 folder had generated a 14G removedb_Allele.txt, a 217G removedb_nonBest.txt and a 884G log.txt and not finished yet, is it normal? Thanks!

Jun 22 '21 14:06 linshengnan09

Prune.tar.gz Hi @linshengnan09 , yes.It is normal. The prune step may take some time and space when the bam file is big. However, during the development of ALLHiC2, we have tried to speed up the prune step by utilizing htslib. Attachment is an improved version of ALLHiC_prune2. Would you please test the script and let us know if there is any problem regarding this version? Installation is quite simple by typing make. Thanks!

Jun 23 '21 01:06 tangerzhang

@linshengnan09 BTW, are you working on a simple diploid genome or a complex polyploid genome that needs haplotype-resolved assembly? If it is a simple diploid genome, such as rice and tomato, prune step should be omitted.

Jun 23 '21 01:06 tangerzhang

Hi @tangerzhang I am honored to try it. The genome is a diploid and have a 81% repeat sequences , the genome size is ~ 1G , hybrid rate is about 0.36%. I think it is a complex genome. I have tried the 3d-DNA pipeline and the ALLHIC pipeline for a simple diploid genome, it didn't work well.

Jun 23 '21 02:06 linshengnan09

Actually, the heterozygous ratio is only 0.36% and you do not need a phased assembly. In other words, prune will be not helpful. Would you like try a wrapped script for ALLHiC pipeline (https://github.com/tangerzhang/ALLHiC/blob/master/bin/ALLHiC_pip.sh), which includes reads mapping, contig correction, partition, optimize and build functions. This pipeline is suitable only for diploid genome that do not need phasing.

Jun 23 '21 02:06 tangerzhang

May I ask you further by email？Thanks!

Jun 23 '21 02:06 linshengnan09

Sure, not problem. My email address is: [email protected]

Jun 23 '21 03:06 tangerzhang

ALLHiC ALLHiC copied to clipboard

run partition.pl error

ALLHiC
ALLHiC copied to clipboard