ALLHiC icon indicating copy to clipboard operation
ALLHiC copied to clipboard

Why Partition contigs into only one group ?

Open Johnsonzcode opened this issue 4 years ago • 18 comments

Hi Dr.zhang Thanks your great work in ALLHIC ! I am following the step in ALLHiC: scaffolding of a simple diploid genome, and mapping my reads by using the code in https://github.com/tangerzhang/ALLHiC/wiki#running-allhic. After many times, it stills remain uncorrect group number :

ALLHiC_partition -b allhic_clean.bam -r merged_out.fasta -e AAGCTT -k 39
Extract function: calculate an empirical distribution of Hi-C link size based on intra-contig links
CMD: allhic extract allhic_clean.bam merged_out.fasta --RE AAGCTT
18:20:14 writeRE | NOTICE  RE counts in 132 contigs (total: 418042, avg 1 per 2694 bp) written to `allhic_clean.counts_AAGCTT.txt`
18:20:14 extractContigLinks | NOTICE  Parse bamfile `allhic_clean.bam`
18:22:31 extractContigLinks | NOTICE  Extracted 101 intra-contig link groups (total = 23799910)
18:22:49 extractContigLinks | NOTICE  Extracted 5304 inter-contig groups to `allhic_clean.clm` (total = 56210700, maxLinks = 621880, minLinks = 3)
18:22:50 fitPowerLaw | NOTICE  Power law Y = 0.0007 * X ^ -0.9082
18:22:50 writeDistribution | NOTICE  Link size distribution written to `allhic_clean.distribution.txt`
18:22:50 readClmLines | NOTICE  Parse clmfile `allhic_clean.clm`
18:22:54 calcInterContigs | NOTICE  Contig pair analyses written to `allhic_clean.pairs.txt`
18:22:54 Run | NOTICE  Success
Partition contigs based on prunning bam file
CMD: allhic partition allhic_clean.counts_AAGCTT.txt allhic_clean.pairs.txt 39 --minREs 25
18:22:54 ReadCSVLines | NOTICE  Parse csvfile `allhic_clean.counts_AAGCTT.txt`
18:22:54 readRE | NOTICE  Loaded 132 contig RE lengths for normalization from `allhic_clean.counts_AAGCTT.txt`
18:22:54 skipContigsWithFewREs | NOTICE  skipContigsWithFewREs with MinREs = 25 (RE = clean.counts)
Contig #0 (ctg000000_87047) has 2 RE sites -> MARKED SHORT
Contig #3 (ctg000120_360759.1) has 3 RE sites -> MARKED SHORT
Contig #10 (ctg000260_231929) has 20 RE sites -> MARKED SHORT
Contig #13 (ctg000390_209632) has 17 RE sites -> MARKED SHORT
Contig #15 (ctg000480_218654.1) has 7 RE sites -> MARKED SHORT
Contig #16 (ctg000570_166868) has 15 RE sites -> MARKED SHORT
Contig #17 (ctg000580_268020) has 21 RE sites -> MARKED SHORT
Contig #20 (ctg000640_183875) has 21 RE sites -> MARKED SHORT
Contig #21 (ctg000650_96535) has 9 RE sites -> MARKED SHORT
Contig #22 (ctg000660_161107.1) has 1 RE sites -> MARKED SHORT
Contig #24 (ctg000680_114460) has 3 RE sites -> MARKED SHORT
Contig #29 (ctg000750_219888) has 9 RE sites -> MARKED SHORT
Contig #30 (ctg000760_138168) has 2 RE sites -> MARKED SHORT
Contig #35 (ctg000900_151556) has 22 RE sites -> MARKED SHORT
Contig #36 (ctg000950_133862.1) has 3 RE sites -> MARKED SHORT
Contig #37 (ctg000960_183813) has 7 RE sites -> MARKED SHORT
Contig #39 (ctg001030_62866) has 5 RE sites -> MARKED SHORT
Contig #43 (ctg001190_28810) has 6 RE sites -> MARKED SHORT
Contig #45 (ctg001250_34616) has 7 RE sites -> MARKED SHORT
Contig #48 (ctg001420_232259) has 17 RE sites -> MARKED SHORT
Contig #56 (ctg001600_69605) has 18 RE sites -> MARKED SHORT
Contig #57 (ctg001610_249986.1) has 20 RE sites -> MARKED SHORT
Contig #61 (ctg001680_82745) has 1 RE sites -> MARKED SHORT
Contig #62 (ctg001690_174113) has 2 RE sites -> MARKED SHORT
Contig #63 (ctg001700_219832) has 1 RE sites -> MARKED SHORT
Contig #65 (ctg001770_88761) has 8 RE sites -> MARKED SHORT
Contig #67 (ctg000030_375493) has 6 RE sites -> MARKED SHORT
Contig #77 (ctg000300_247109) has 1 RE sites -> MARKED SHORT
Contig #78 (ctg000370_1125237) has 22 RE sites -> MARKED SHORT
Contig #88 (ctg000840_228019) has 1 RE sites -> MARKED SHORT
Contig #89 (ctg000850_142256) has 14 RE sites -> MARKED SHORT
Contig #90 (ctg000880_246392) has 1 RE sites -> MARKED SHORT
Contig #91 (ctg000930_161761) has 14 RE sites -> MARKED SHORT
Contig #99 (ctg001270_192646) has 2 RE sites -> MARKED SHORT
Contig #103 (ctg001570_357441) has 3 RE sites -> MARKED SHORT
Contig #118 (ptg000109l) has 10 RE sites -> MARKED SHORT
Contig #120 (ptg000124l) has 14 RE sites -> MARKED SHORT
Contig #128 (ptg000371l) has 3 RE sites -> MARKED SHORT
Contig #130 (ptg000458l) has 1 RE sites -> MARKED SHORT
18:22:54 skipContigsWithFewREs | NOTICE  Marked 39 contigs (avg 8.7 RE sites, len 270937) since they contain too few REs (MinREs = 25)
18:22:54 ReadCSVLines | NOTICE  Parse csvfile `allhic_clean.pairs.txt`
18:22:54 skipRepeats | NOTICE  skipRepeats with multiplicity = 2
Contig #6 (ctg000210_3318479.1) has 2.6x the average number of Hi-C links -> MARKED REPETITIVE
Contig #14 (ctg000400_3549130.1) has 2.1x the average number of Hi-C links -> MARKED REPETITIVE
Contig #49 (ctg001430_1803570.1) has 2.1x the average number of Hi-C links -> MARKED REPETITIVE
Contig #52 (ctg001510_182939) has 2.8x the average number of Hi-C links -> MARKED REPETITIVE
Contig #65 (ctg001770_88761) has 2.0x the average number of Hi-C links -> MARKED REPETITIVE
Contig #75 (ctg000220_2289644) has 3.0x the average number of Hi-C links -> MARKED REPETITIVE
Contig #78 (ctg000370_1125237) has 2.2x the average number of Hi-C links -> MARKED REPETITIVE
Contig #88 (ctg000840_228019) has 3.1x the average number of Hi-C links -> MARKED REPETITIVE
Contig #89 (ctg000850_142256) has 2.8x the average number of Hi-C links -> MARKED REPETITIVE
Contig #91 (ctg000930_161761) has 2.0x the average number of Hi-C links -> MARKED REPETITIVE
Contig #103 (ctg001570_357441) has 2.4x the average number of Hi-C links -> MARKED REPETITIVE
Contig #105 (ctg001710_930594) has 2.3x the average number of Hi-C links -> MARKED REPETITIVE
18:22:54 skipRepeats | NOTICE  Marked 12 contigs (avg len 1309990) as repetitive (MaxLinkDensity = 2)
18:22:54 Cluster | NOTICE  Clustering starts with 132 (87 informative) contigs with target of 39 clusters
18:22:54 Cluster | NOTICE  Merge #50: Clusters  82 + 98 -> 181, Linkage = 1.1223572e+07
18:22:54 Cluster | NOTICE  No more merges to do since the queue is empty
18:22:54 setClusters | NOTICE  setClusters summary (NonInformativeRatio = 3): nPassRatio = 45, nFailRatio = 0, nFailCluster=0
18:22:54 printClusters | NOTICE  Write 1 partitions to `allhic_clean.clusters.txt`
18:22:54 writeRE | NOTICE  RE counts in 132 contigs (total: 418042, avg 1 per 2694 bp) written to `allhic_clean.counts_AAGCTT.39g1.txt`
18:22:54 Run | NOTICE  Success
$ ll

-rw-rw-r-- 1 poultrylab1 poultrylab1 3.6K Jan 14 18:22 allhic_clean.counts_AAGCTT.39g1.txt
-rw-rw-r-- 1 poultrylab1 poultrylab1 2.2K Jan 14 18:22 allhic_clean.clusters.txt
-rw-rw-r-- 1 poultrylab1 poultrylab1 273K Jan 14 18:22 allhic_clean.pairs.txt
-rw-rw-r-- 1 poultrylab1 poultrylab1  11K Jan 14 18:22 allhic_clean.distribution.txt
-rw-rw-r-- 1 poultrylab1 poultrylab1 486M Jan 14 18:22 allhic_clean.clm
-rw-rw-r-- 1 poultrylab1 poultrylab1 3.6K Jan 14 18:20 allhic_clean.counts_AAGCTT.txt
drwxrwxr-x 5 poultrylab1 poultrylab1  164 Jan 14 18:11 map
-rw-rw-r-- 1 poultrylab1 poultrylab1 6.7G Jan 14 18:09 allhic_clean.bam
-rw-rw-r-- 1 poultrylab1 poultrylab1 9.4M Jan 13 04:51 merged_out.fasta.near_AAGCTT.500.bed
-rw-rw-r-- 1 poultrylab1 poultrylab1 3.5M Jan 13 04:51 merged_out.fasta.pos_of_AAGCTT.txt
-rw-rw-r-- 1 poultrylab1 poultrylab1 1.4K Jan 13 04:09 map.sh
drwxrwxr-x 2 poultrylab1 poultrylab1   53 Jan  7 19:48 partion
-rw-rw-r-- 1 poultrylab1 poultrylab1 538M Jan  7 01:18 merged_out.fasta.sa
-rw-rw-r-- 1 poultrylab1 poultrylab1   17 Jan  7 01:12 merged_out.fasta.amb
-rw-rw-r-- 1 poultrylab1 poultrylab1 5.9K Jan  7 01:12 merged_out.fasta.ann
-rw-rw-r-- 1 poultrylab1 poultrylab1 269M Jan  7 01:12 merged_out.fasta.pac
-rw-rw-r-- 1 poultrylab1 poultrylab1 1.1G Jan  7 01:12 merged_out.fasta.bwt
-rw-rw-r-- 1 poultrylab1 poultrylab1 6.4K Jan  7 01:11 merged_out.fasta.fai
lrwxrwxrwx 1 poultrylab1 poultrylab1   42 Jan  7 00:59 merged_out.fasta -> ../../6.scaffolding/third/merged_out.fasta

Three repeat samples were mapping by Aligning Hi-C reads to the draft assembly & Filtering SAM file in https://github.com/tangerzhang/ALLHiC/wiki#running-allhic, then I merged three xxx.clean.bam into allhic_clean.bam by samtools merge. It reported No @HD tag found.

I am not sure where I use the wrong way, but I think something must go wrong. Such as, my HiC data is PE150, it seems bwa aln not the best way to map.

Thanks in advance !!! Waiting for your reply~

Johnsonzcode avatar Jan 14 '21 03:01 Johnsonzcode

Hi @Johnsonzcode This situation happens when the initial assembly has many chimeric contigs. You can first correct the contigs using 3D-DNA or ALLHiC_corrector. If you are using ALLHiC_corrector, the following steps may be useful:

  1. mapping Hi-C reads to the initial contig assembly (bwa mem is prefered)
  2. sort and index the bam file
  3. run ALLHiC_corrector (requires pysam and numpy installed)
ALLHiC_corrector.py -m mapping.sorted.bam -r seq.fasta -o seq.corrected.fasta -t 48
  1. run ALLHiC scaffolding following the guideline: https://github.com/tangerzhang/ALLHiC/wiki/ALLHiC:-scaffolding-of-a-simple-diploid-genome

tangerzhang avatar Jan 14 '21 06:01 tangerzhang

Thank your for your quick reply! I am not sure about Filtering SAM file in https://github.com/tangerzhang/ALLHiC/wiki, which tell me skip this step if you are using bwa mem for alignment :

PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
(*)filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam

but if I skip this step, wheresample.clean.sam in samtools view should be repalced by sample.bwa_mem.REduced.paired_only.bam or what ? I am new about this.

Johnsonzcode avatar Jan 15 '21 05:01 Johnsonzcode

Yes, you are right. If you skip the filtering step, you can use sample.bwa_mem.REduced.paired_only.bam instead. Alternatively, if you are working on a genome with a high level of repetitive sequences, you may need to select uniquely mapped reads via samtools view -q 40

tangerzhang avatar Jan 17 '21 16:01 tangerzhang

Thank you sincerely !

Johnsonzcode avatar Jan 18 '21 00:01 Johnsonzcode

Sorry for reopening it. When I try to correct contigs, mapping HiC reads to draft assembly confused me, in which bwa mem:

  1. I am not sure using bwa-mem with paird-end mode or single-end mode, I choose single-end mode , am I right ?
  2. But it results two sam or bam file , should I merge them with samtools ?
  3. Then follow your code : https://github.com/tangerzhang/ALLHiC/issues/75#issuecomment-759946162 , right ?

Johnsonzcode avatar Feb 18 '21 07:02 Johnsonzcode

Sorry for reopening it. When I try to correct contigs, mapping HiC reads to draft assembly confused me, in which bwa mem:

  1. I am not sure using bwa-mem with paird-end mode or single-end mode, I choose single-end mode , am I right ?
  2. But it results two sam or bam file , should I merge them with samtools ?
  3. Then follow your code : #75 (comment) , right ?

bwa-mem with paired-end mode should be used.

tangerzhang avatar Feb 18 '21 08:02 tangerzhang

Thank you for your help !

And I have used HiC-Pro, which use bowtie2 with single-end mode to map HiC-reads, perhaps it will generate more valid reads by cutting reads. I am curious about why ALLHIC just use bwa-mem with piared-end mode ?

Happy new year!

Johnsonzcode avatar Feb 18 '21 08:02 Johnsonzcode

ALLHiC_correct takes bwa-mem with paired-end reads alignment but we have not tested it on bowtie2 with single-end mode. Please let us know if it works on bowtie2 mapping. Otherwise, we can fix this bug and make it compatible with bowtie2 mapping.

tangerzhang avatar Feb 19 '21 12:02 tangerzhang

Ok, I will try. I rerun bwa-mem to get bam file as your suggestion, after correcting my draft, ALLHiC_partition step shows log like log.txt, and still only got 1 group.

Johnsonzcode avatar Feb 19 '21 13:02 Johnsonzcode

@Johnsonzcode After correction with 3D-DNA or ALLHiC_corrector, you should rebuild index with output from corrector.

wyim-pgl avatar Feb 19 '21 23:02 wyim-pgl

@wyim-pgl Thank you, I guess you mean samtools faidx and bwa index my correction assembly ? Indeed, I forgot to index last time. But still got 1 group after indexing it. Here is my code :

draft=hybrid_m_all.fasta
bwa index $draft && echo "bwa index done"
samtools faidx $draft && echo "samtools faidx done"

for i in `ls sample1/*_R1.fastq.gz`;
do
ii=`basename $i`
mkdir ${ii/%_R1.fastq.gz}
bwa mem -t 150 $draft $i ${i/%_R1.fastq.gz/_R2.fastq.gz} > ${ii/%_R1.fastq.gz}/${ii/%_R1.fastq.gz}.sam && echo "mapping done"
PreprocessSAMs.pl ${ii/%_R1.fastq.gz}/${ii/%_R1.fastq.gz}.sam $draft HINDIII && echo "preprocessing done"

samtools view -@ 150 -bt ${draft}.fai ${ii/%_R1.fastq.gz}/${ii/%_R1.fastq.gz}.REduced.paired_only.bam > ${ii/%_R1.fastq.gz}/${ii/%_R1.fastq.gz}.clean.bam && echo "cleanning down"
done
samtools sort -o RHC01168_L8_sort.bam -@ 12 RHC01168_L8.bam && samtools index RHC01168_L8_sort.bam

ALLHiC_corrector.py -m RHC01168_L8_sort.bam -r hybrid_m_all.fasta -o hybrid_m_all_cor.fasta -t 48
samtools faidx hybrid_m_all_cor.fasta && bwa index hybrid_m_all_cor.fasta
ALLHiC_partition -b RHC01168_L8.clean.bam -r hybrid_m_all_cor.fasta -e AAGCTT -k 41

Here I got :

-rw-rw-r-- 1 278 Feb 20 17:37 RHC01168_L8.clean.counts_AAGCTT.41g1.txt
-rw-rw-r-- 1 192 Feb 20 17:37 RHC01168_L8.clean.clusters.txt
-rw-rw-r-- 1 0 Feb 20 17:35 RHC01168_L8.clean.clm
-rw-rw-r-- 1 187K Feb 20 17:35 RHC01168_L8.clean.counts_AAGCTT.txt
-rw-rw-r-- 1  607M Feb 20 17:34 hybrid_m_all_cor.fasta.sa
-rw-rw-r-- 1 6.1K Feb 20 17:25 hybrid_m_all_cor.fasta.amb
-rw-rw-r-- 1  271K Feb 20 17:25 hybrid_m_all_cor.fasta.ann
-rw-rw-r-- 1 304M Feb 20 17:25 hybrid_m_all_cor.fasta.pac
-rw-rw-r-- 11.2G Feb 20 17:25 hybrid_m_all_cor.fasta.bwt
-rw-rw-r-- 1 272K Feb 20 17:09 hybrid_m_all_cor.fasta.fai
-rw-rw-r-- 1 23K Feb 20 05:10 RHC01168_L8.clean.pairs.txt
-rw-rw-r-- 1  11K Feb 20 05:10 RHC01168_L8.clean.distribution.txt
-rw-rw-r-- 1 347K Feb 20 04:59 log.txt
-rw------- 1  80K Feb 20 03:57 nohup.out
-rw-rw-r-- 1 1.2G Feb 20 03:57 hybrid_m_all_cor.fasta
-rw-rw-r-- 1 4.1M Feb 20 03:00 RHC01168_L8_sort.bam.bai
-rw-rw-r-- 1 88G Feb 20 02:55 RHC01168_L8_sort.bam
-rw-rw-r-- 1 45G Feb 19 20:13 RHC01168_L8.clean.bam
-rw-rw-r-- 1 457 Feb 19 20:10 RHC01168_L8.REduced.paired_only.flagstat
-rw-rw-r-- 1 45G Feb 19 20:00 RHC01168_L8.REduced.paired_only.bam
-rw-rw-r-- 1 45G Feb 19 18:07 RHC01168_L8.REduced.bam
-rw-rw-r-- 1 128G Feb 19 15:12 RHC01168_L8.bam

Johnsonzcode avatar Feb 20 '21 02:02 Johnsonzcode

hi, Johnsonzcode. What is the size of your genome? How long did 'bwa mem' run?

Ahahaha3 avatar Feb 24 '21 13:02 Ahahaha3

Hi Ahahaha3, my genome is about 1.2Gb, and it is time-consuming, I don't remember exactly.

Johnsonzcode avatar Feb 24 '21 13:02 Johnsonzcode

@Johnsonzcode with this output hybrid_m_all_cor.fasta, you need to do bwa index, bwa mem and run ALLHiC again.

wyim-pgl avatar Feb 25 '21 19:02 wyim-pgl

@wyim-pgl Do you mean remap the HiC reads to hybrid_m_all_cor.fasta ?

Johnsonzcode avatar Feb 26 '21 02:02 Johnsonzcode

yes

wyim-pgl avatar Feb 26 '21 02:02 wyim-pgl

@wyim-pgl Thank you so much! I got HiC heatmap of my assembly. But I don't know how to interpret it. Is there some wiki to learn about HiC heatmap and assess my assembly. 500K_all_chrs.pdf 500K_Whole_genome.pdf

Johnsonzcode avatar Feb 27 '21 07:02 Johnsonzcode

Hi Dr.zhang Thanks your great work in ALLHIC ! I got the similar error when running ALLHiC-pip, after two round of mapping, I got only 1 group in ALLHiC-partition even set k=10. The error.log shows "Cluster | NOTICE ESC[0m No more merges to do since the queue is empty", but the group just contains 193 contigs of 1651 contigs .

lizhao007 avatar Jan 28 '22 03:01 lizhao007