ALLHiC Why Partition contigs into only one group ?

trafficstars

Hi Dr.zhang Thanks your great work in ALLHIC ! I am following the step in ALLHiC: scaffolding of a simple diploid genome , and mapping my reads by using the code in https://github.com/tangerzhang/ALLHiC/wiki#running-allhic. After many times, it stills remain uncorrect group number :

ALLHiC_partition -b allhic_clean.bam -r merged_out.fasta -e AAGCTT -k 39
Extract function: calculate an empirical distribution of Hi-C link size based on intra-contig links
CMD: allhic extract allhic_clean.bam merged_out.fasta --RE AAGCTT
18:20:14 writeRE | NOTICE  RE counts in 132 contigs (total: 418042, avg 1 per 2694 bp) written to `allhic_clean.counts_AAGCTT.txt`
18:20:14 extractContigLinks | NOTICE  Parse bamfile `allhic_clean.bam`
18:22:31 extractContigLinks | NOTICE  Extracted 101 intra-contig link groups (total = 23799910)
18:22:49 extractContigLinks | NOTICE  Extracted 5304 inter-contig groups to `allhic_clean.clm` (total = 56210700, maxLinks = 621880, minLinks = 3)
18:22:50 fitPowerLaw | NOTICE  Power law Y = 0.0007 * X ^ -0.9082
18:22:50 writeDistribution | NOTICE  Link size distribution written to `allhic_clean.distribution.txt`
18:22:50 readClmLines | NOTICE  Parse clmfile `allhic_clean.clm`
18:22:54 calcInterContigs | NOTICE  Contig pair analyses written to `allhic_clean.pairs.txt`
18:22:54 Run | NOTICE  Success
Partition contigs based on prunning bam file
CMD: allhic partition allhic_clean.counts_AAGCTT.txt allhic_clean.pairs.txt 39 --minREs 25
18:22:54 ReadCSVLines | NOTICE  Parse csvfile `allhic_clean.counts_AAGCTT.txt`
18:22:54 readRE | NOTICE  Loaded 132 contig RE lengths for normalization from `allhic_clean.counts_AAGCTT.txt`
18:22:54 skipContigsWithFewREs | NOTICE  skipContigsWithFewREs with MinREs = 25 (RE = clean.counts)
Contig #0 (ctg000000_87047) has 2 RE sites -> MARKED SHORT
Contig #3 (ctg000120_360759.1) has 3 RE sites -> MARKED SHORT
Contig #10 (ctg000260_231929) has 20 RE sites -> MARKED SHORT
Contig #13 (ctg000390_209632) has 17 RE sites -> MARKED SHORT
Contig #15 (ctg000480_218654.1) has 7 RE sites -> MARKED SHORT
Contig #16 (ctg000570_166868) has 15 RE sites -> MARKED SHORT
Contig #17 (ctg000580_268020) has 21 RE sites -> MARKED SHORT
Contig #20 (ctg000640_183875) has 21 RE sites -> MARKED SHORT
Contig #21 (ctg000650_96535) has 9 RE sites -> MARKED SHORT
Contig #22 (ctg000660_161107.1) has 1 RE sites -> MARKED SHORT
Contig #24 (ctg000680_114460) has 3 RE sites -> MARKED SHORT
Contig #29 (ctg000750_219888) has 9 RE sites -> MARKED SHORT
Contig #30 (ctg000760_138168) has 2 RE sites -> MARKED SHORT
Contig #35 (ctg000900_151556) has 22 RE sites -> MARKED SHORT
Contig #36 (ctg000950_133862.1) has 3 RE sites -> MARKED SHORT
Contig #37 (ctg000960_183813) has 7 RE sites -> MARKED SHORT
Contig #39 (ctg001030_62866) has 5 RE sites -> MARKED SHORT
Contig #43 (ctg001190_28810) has 6 RE sites -> MARKED SHORT
Contig #45 (ctg001250_34616) has 7 RE sites -> MARKED SHORT
Contig #48 (ctg001420_232259) has 17 RE sites -> MARKED SHORT
Contig #56 (ctg001600_69605) has 18 RE sites -> MARKED SHORT
Contig #57 (ctg001610_249986.1) has 20 RE sites -> MARKED SHORT
Contig #61 (ctg001680_82745) has 1 RE sites -> MARKED SHORT
Contig #62 (ctg001690_174113) has 2 RE sites -> MARKED SHORT
Contig #63 (ctg001700_219832) has 1 RE sites -> MARKED SHORT
Contig #65 (ctg001770_88761) has 8 RE sites -> MARKED SHORT
Contig #67 (ctg000030_375493) has 6 RE sites -> MARKED SHORT
Contig #77 (ctg000300_247109) has 1 RE sites -> MARKED SHORT
Contig #78 (ctg000370_1125237) has 22 RE sites -> MARKED SHORT
Contig #88 (ctg000840_228019) has 1 RE sites -> MARKED SHORT
Contig #89 (ctg000850_142256) has 14 RE sites -> MARKED SHORT
Contig #90 (ctg000880_246392) has 1 RE sites -> MARKED SHORT
Contig #91 (ctg000930_161761) has 14 RE sites -> MARKED SHORT
Contig #99 (ctg001270_192646) has 2 RE sites -> MARKED SHORT
Contig #103 (ctg001570_357441) has 3 RE sites -> MARKED SHORT
Contig #118 (ptg000109l) has 10 RE sites -> MARKED SHORT
Contig #120 (ptg000124l) has 14 RE sites -> MARKED SHORT
Contig #128 (ptg000371l) has 3 RE sites -> MARKED SHORT
Contig #130 (ptg000458l) has 1 RE sites -> MARKED SHORT
18:22:54 skipContigsWithFewREs | NOTICE  Marked 39 contigs (avg 8.7 RE sites, len 270937) since they contain too few REs (MinREs = 25)
18:22:54 ReadCSVLines | NOTICE  Parse csvfile `allhic_clean.pairs.txt`
18:22:54 skipRepeats | NOTICE  skipRepeats with multiplicity = 2
Contig #6 (ctg000210_3318479.1) has 2.6x the average number of Hi-C links -> MARKED REPETITIVE
Contig #14 (ctg000400_3549130.1) has 2.1x the average number of Hi-C links -> MARKED REPETITIVE
Contig #49 (ctg001430_1803570.1) has 2.1x the average number of Hi-C links -> MARKED REPETITIVE
Contig #52 (ctg001510_182939) has 2.8x the average number of Hi-C links -> MARKED REPETITIVE
Contig #65 (ctg001770_88761) has 2.0x the average number of Hi-C links -> MARKED REPETITIVE
Contig #75 (ctg000220_2289644) has 3.0x the average number of Hi-C links -> MARKED REPETITIVE
Contig #78 (ctg000370_1125237) has 2.2x the average number of Hi-C links -> MARKED REPETITIVE
Contig #88 (ctg000840_228019) has 3.1x the average number of Hi-C links -> MARKED REPETITIVE
Contig #89 (ctg000850_142256) has 2.8x the average number of Hi-C links -> MARKED REPETITIVE
Contig #91 (ctg000930_161761) has 2.0x the average number of Hi-C links -> MARKED REPETITIVE
Contig #103 (ctg001570_357441) has 2.4x the average number of Hi-C links -> MARKED REPETITIVE
Contig #105 (ctg001710_930594) has 2.3x the average number of Hi-C links -> MARKED REPETITIVE
18:22:54 skipRepeats | NOTICE  Marked 12 contigs (avg len 1309990) as repetitive (MaxLinkDensity = 2)
18:22:54 Cluster | NOTICE  Clustering starts with 132 (87 informative) contigs with target of 39 clusters
18:22:54 Cluster | NOTICE  Merge #50: Clusters  82 + 98 -> 181, Linkage = 1.1223572e+07
18:22:54 Cluster | NOTICE  No more merges to do since the queue is empty
18:22:54 setClusters | NOTICE  setClusters summary (NonInformativeRatio = 3): nPassRatio = 45, nFailRatio = 0, nFailCluster=0
18:22:54 printClusters | NOTICE  Write 1 partitions to `allhic_clean.clusters.txt`
18:22:54 writeRE | NOTICE  RE counts in 132 contigs (total: 418042, avg 1 per 2694 bp) written to `allhic_clean.counts_AAGCTT.39g1.txt`
18:22:54 Run | NOTICE  Success
$ ll

-rw-rw-r-- 1 poultrylab1 poultrylab1 3.6K Jan 14 18:22 allhic_clean.counts_AAGCTT.39g1.txt
-rw-rw-r-- 1 poultrylab1 poultrylab1 2.2K Jan 14 18:22 allhic_clean.clusters.txt
-rw-rw-r-- 1 poultrylab1 poultrylab1 273K Jan 14 18:22 allhic_clean.pairs.txt
-rw-rw-r-- 1 poultrylab1 poultrylab1  11K Jan 14 18:22 allhic_clean.distribution.txt
-rw-rw-r-- 1 poultrylab1 poultrylab1 486M Jan 14 18:22 allhic_clean.clm
-rw-rw-r-- 1 poultrylab1 poultrylab1 3.6K Jan 14 18:20 allhic_clean.counts_AAGCTT.txt
drwxrwxr-x 5 poultrylab1 poultrylab1  164 Jan 14 18:11 map
-rw-rw-r-- 1 poultrylab1 poultrylab1 6.7G Jan 14 18:09 allhic_clean.bam
-rw-rw-r-- 1 poultrylab1 poultrylab1 9.4M Jan 13 04:51 merged_out.fasta.near_AAGCTT.500.bed
-rw-rw-r-- 1 poultrylab1 poultrylab1 3.5M Jan 13 04:51 merged_out.fasta.pos_of_AAGCTT.txt
-rw-rw-r-- 1 poultrylab1 poultrylab1 1.4K Jan 13 04:09 map.sh
drwxrwxr-x 2 poultrylab1 poultrylab1   53 Jan  7 19:48 partion
-rw-rw-r-- 1 poultrylab1 poultrylab1 538M Jan  7 01:18 merged_out.fasta.sa
-rw-rw-r-- 1 poultrylab1 poultrylab1   17 Jan  7 01:12 merged_out.fasta.amb
-rw-rw-r-- 1 poultrylab1 poultrylab1 5.9K Jan  7 01:12 merged_out.fasta.ann
-rw-rw-r-- 1 poultrylab1 poultrylab1 269M Jan  7 01:12 merged_out.fasta.pac
-rw-rw-r-- 1 poultrylab1 poultrylab1 1.1G Jan  7 01:12 merged_out.fasta.bwt
-rw-rw-r-- 1 poultrylab1 poultrylab1 6.4K Jan  7 01:11 merged_out.fasta.fai
lrwxrwxrwx 1 poultrylab1 poultrylab1   42 Jan  7 00:59 merged_out.fasta -> ../../6.scaffolding/third/merged_out.fasta

Three repeat samples were mapping by Aligning Hi-C reads to the draft assembly & Filtering SAM file in https://github.com/tangerzhang/ALLHiC/wiki#running-allhic, then I merged three xxx.clean.bam into allhic_clean.bam by samtools merge. It reported No @HD tag found.

I am not sure where I use the wrong way, but I think something must go wrong. Such as, my HiC data is PE150, it seems bwa aln not the best way to map.

Thanks in advance !!! Waiting for your reply~

Jan 14 '21 03:01 Johnsonzcode

Hi @Johnsonzcode This situation happens when the initial assembly has many chimeric contigs. You can first correct the contigs using 3D-DNA or ALLHiC_corrector. If you are using ALLHiC_corrector, the following steps may be useful:

mapping Hi-C reads to the initial contig assembly (bwa mem is prefered)
sort and index the bam file
run ALLHiC_corrector (requires pysam and numpy installed)

ALLHiC_corrector.py -m mapping.sorted.bam -r seq.fasta -o seq.corrected.fasta -t 48

run ALLHiC scaffolding following the guideline: https://github.com/tangerzhang/ALLHiC/wiki/ALLHiC:-scaffolding-of-a-simple-diploid-genome

Jan 14 '21 06:01 tangerzhang

Thank your for your quick reply! I am not sure about Filtering SAM file in https://github.com/tangerzhang/ALLHiC/wiki, which tell me skip this step if you are using bwa mem for alignment :

PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
(*)filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam

but if I skip this step, wheresample.clean.sam in samtools view should be repalced by sample.bwa_mem.REduced.paired_only.bam or what ? I am new about this.

Jan 15 '21 05:01 Johnsonzcode

Yes, you are right. If you skip the filtering step, you can use sample.bwa_mem.REduced.paired_only.bam instead. Alternatively, if you are working on a genome with a high level of repetitive sequences, you may need to select uniquely mapped reads via samtools view -q 40

Jan 17 '21 16:01 tangerzhang

Thank you sincerely !

Jan 18 '21 00:01 Johnsonzcode

Sorry for reopening it. When I try to correct contigs, mapping HiC reads to draft assembly confused me, in which bwa mem:

I am not sure using bwa-mem with paird-end mode or single-end mode, I choose single-end mode , am I right ?
But it results two sam or bam file , should I merge them with samtools ?
Then follow your code : https://github.com/tangerzhang/ALLHiC/issues/75#issuecomment-759946162 , right ?

Feb 18 '21 07:02 Johnsonzcode

Sorry for reopening it. When I try to correct contigs, mapping HiC reads to draft assembly confused me, in which bwa mem:

I am not sure using bwa-mem with paird-end mode or single-end mode, I choose single-end mode , am I right ?

But it results two sam or bam file , should I merge them with samtools ?

Then follow your code : #75 (comment) , right ?

bwa-mem with paired-end mode should be used.

Feb 18 '21 08:02 tangerzhang

Thank you for your help !

And I have used HiC-Pro, which use bowtie2 with single-end mode to map HiC-reads, perhaps it will generate more valid reads by cutting reads. I am curious about why ALLHIC just use bwa-mem with piared-end mode ?

Happy new year!

Feb 18 '21 08:02 Johnsonzcode

ALLHiC_correct takes bwa-mem with paired-end reads alignment but we have not tested it on bowtie2 with single-end mode. Please let us know if it works on bowtie2 mapping. Otherwise, we can fix this bug and make it compatible with bowtie2 mapping.

Feb 19 '21 12:02 tangerzhang

Ok, I will try. I rerun bwa-mem to get bam file as your suggestion, after correcting my draft, ALLHiC_partition step shows log like log.txt, and still only got 1 group.

Feb 19 '21 13:02 Johnsonzcode

@Johnsonzcode After correction with 3D-DNA or ALLHiC_corrector, you should rebuild index with output from corrector.

Feb 19 '21 23:02 wyim-pgl

@wyim-pgl Thank you, I guess you mean samtools faidx and bwa index my correction assembly ? Indeed, I forgot to index last time. But still got 1 group after indexing it. Here is my code :

draft=hybrid_m_all.fasta
bwa index $draft && echo "bwa index done"
samtools faidx $draft && echo "samtools faidx done"

for i in `ls sample1/*_R1.fastq.gz`;
do
ii=`basename $i`
mkdir ${ii/%_R1.fastq.gz}
bwa mem -t 150 $draft $i ${i/%_R1.fastq.gz/_R2.fastq.gz} > ${ii/%_R1.fastq.gz}/${ii/%_R1.fastq.gz}.sam && echo "mapping done"
PreprocessSAMs.pl ${ii/%_R1.fastq.gz}/${ii/%_R1.fastq.gz}.sam $draft HINDIII && echo "preprocessing done"

samtools view -@ 150 -bt ${draft}.fai ${ii/%_R1.fastq.gz}/${ii/%_R1.fastq.gz}.REduced.paired_only.bam > ${ii/%_R1.fastq.gz}/${ii/%_R1.fastq.gz}.clean.bam && echo "cleanning down"
done
samtools sort -o RHC01168_L8_sort.bam -@ 12 RHC01168_L8.bam && samtools index RHC01168_L8_sort.bam

ALLHiC_corrector.py -m RHC01168_L8_sort.bam -r hybrid_m_all.fasta -o hybrid_m_all_cor.fasta -t 48
samtools faidx hybrid_m_all_cor.fasta && bwa index hybrid_m_all_cor.fasta
ALLHiC_partition -b RHC01168_L8.clean.bam -r hybrid_m_all_cor.fasta -e AAGCTT -k 41

Here I got :

-rw-rw-r-- 1 278 Feb 20 17:37 RHC01168_L8.clean.counts_AAGCTT.41g1.txt
-rw-rw-r-- 1 192 Feb 20 17:37 RHC01168_L8.clean.clusters.txt
-rw-rw-r-- 1 0 Feb 20 17:35 RHC01168_L8.clean.clm
-rw-rw-r-- 1 187K Feb 20 17:35 RHC01168_L8.clean.counts_AAGCTT.txt
-rw-rw-r-- 1  607M Feb 20 17:34 hybrid_m_all_cor.fasta.sa
-rw-rw-r-- 1 6.1K Feb 20 17:25 hybrid_m_all_cor.fasta.amb
-rw-rw-r-- 1  271K Feb 20 17:25 hybrid_m_all_cor.fasta.ann
-rw-rw-r-- 1 304M Feb 20 17:25 hybrid_m_all_cor.fasta.pac
-rw-rw-r-- 11.2G Feb 20 17:25 hybrid_m_all_cor.fasta.bwt
-rw-rw-r-- 1 272K Feb 20 17:09 hybrid_m_all_cor.fasta.fai
-rw-rw-r-- 1 23K Feb 20 05:10 RHC01168_L8.clean.pairs.txt
-rw-rw-r-- 1  11K Feb 20 05:10 RHC01168_L8.clean.distribution.txt
-rw-rw-r-- 1 347K Feb 20 04:59 log.txt
-rw------- 1  80K Feb 20 03:57 nohup.out
-rw-rw-r-- 1 1.2G Feb 20 03:57 hybrid_m_all_cor.fasta
-rw-rw-r-- 1 4.1M Feb 20 03:00 RHC01168_L8_sort.bam.bai
-rw-rw-r-- 1 88G Feb 20 02:55 RHC01168_L8_sort.bam
-rw-rw-r-- 1 45G Feb 19 20:13 RHC01168_L8.clean.bam
-rw-rw-r-- 1 457 Feb 19 20:10 RHC01168_L8.REduced.paired_only.flagstat
-rw-rw-r-- 1 45G Feb 19 20:00 RHC01168_L8.REduced.paired_only.bam
-rw-rw-r-- 1 45G Feb 19 18:07 RHC01168_L8.REduced.bam
-rw-rw-r-- 1 128G Feb 19 15:12 RHC01168_L8.bam

Feb 20 '21 02:02 Johnsonzcode

hi, Johnsonzcode. What is the size of your genome? How long did 'bwa mem' run?

Feb 24 '21 13:02 Ahahaha3

Hi Ahahaha3, my genome is about 1.2Gb, and it is time-consuming, I don't remember exactly.

Feb 24 '21 13:02 Johnsonzcode

@Johnsonzcode with this output hybrid_m_all_cor.fasta, you need to do bwa index, bwa mem and run ALLHiC again.

Feb 25 '21 19:02 wyim-pgl

@wyim-pgl Do you mean remap the HiC reads to hybrid_m_all_cor.fasta ?

Feb 26 '21 02:02 Johnsonzcode

yes

Feb 26 '21 02:02 wyim-pgl

@wyim-pgl Thank you so much! I got HiC heatmap of my assembly. But I don't know how to interpret it. Is there some wiki to learn about HiC heatmap and assess my assembly. 500K_all_chrs.pdf 500K_Whole_genome.pdf

Feb 27 '21 07:02 Johnsonzcode

Hi Dr.zhang Thanks your great work in ALLHIC ! I got the similar error when running ALLHiC-pip, after two round of mapping, I got only 1 group in ALLHiC-partition even set k=10. The error.log shows "Cluster | NOTICE ESC[0m No more merges to do since the queue is empty", but the group just contains 193 contigs of 1651 contigs .

Jan 28 '22 03:01 lizhao007

ALLHiC ALLHiC copied to clipboard

Why Partition contigs into only one group ?

ALLHiC
ALLHiC copied to clipboard