HiCExplorer
HiCExplorer copied to clipboard
hicConvertFormat 2D-text to cool generates spurious pairs
- HiCExplorer version == 3.7.2 (installed using conda)
- Python version == 3.9.13
Hi there,
I'm observing a weird behavior of the HicConvertFormat function, when I'm using a 2d-text bed file as input. Here it is the command:
hicConvertFormat -m input.bed -o output.cool --resolutions 10000 --chromosomeSizes hg19_chr_sizes.txt --inputFormat 2D-text --outputFormat cool
1. input .bed file characteristics
The input bed file contains 36526 number of connected pairs and it is formatted as follows : chr_a | start_a | end_a | chr_b | start_b | end_b | contacts
This is the output of head input.bed:
chr1 710000 720000 chr1 750000 760000 17 chr1 940000 950000 chr1 990000 1000000 15 chr1 940000 950000 chr1 1180000 1190000 10 chr1 940000 950000 chr1 1250000 1260000 11
2. output .cool file characteristics
The conversion proceeds without errors, but when I inspect the data, I obtain a dataframe of 68640 connected pairs, in which some spurious pairs are incorporated:
c = cooler.Cooler(filepath)
c.pixels(join=True)[:]
chrom1 | start1 | end1 | chrom2 | start2 | end2 | count chr1 710000 720000 chr1 750000 760000 17.0 chr1 720000 730000 chr1 760000 770000 17.0 ← spurious pair chr1 940000 950000 chr1 990000 1000000 15.0 chr1 940000 950000 chr1 1180000 1190000 10.0 chr1 940000 950000 chr1 1250000 1260000 11.0
3. set of bins
In order to understand if some pairs were also lost during the conversion I generated two set of bins: bed_bins ad cool_bins and I compared them:
3.1 bed_bins
df_bed['start']=df_bed['chr1']+'-'+df_bed['s1'].astype(str)+'-'+df_bed['e1'].astype(str)
df_bed['end']=df_bed['chr2']+'-'+df_bed['s2'].astype(str)+'-'+df_bed['e2'].astype(str)
bed_bins=set(df_bed['start'])|set(df_bed['end'])
len(bed_bins)
26760
3.2 cool_bins
df_cool=c.pixels(join=True)[:]
df_cool['start']=df_cool['chrom1'].astype(str)+'-'+df_cool['start1'].astype(str)+'-'+df_cool['end1'].astype(str)
df_cool['end']=df_cool['chrom2'].astype(str)+'-'+df_cool['start2'].astype(str)+'-'+df_cool['end2'].astype(str)
cool_bins=set(df_cool['start'])|set(df_cool['end'])
len(cool_bins)
41159
3.3 comparison between sets
len(bed_bins-cool_bins)
0 ← 0 bin lost
len(cool_bins-bed_bins)
14399 ← newly generated bins
Any help you can provide would be greatly appreciated! Thanks a lot.
Hi,
I think the issue is that the software consider that the interval:
chr1 710000 720000 both match to the bin chr1 710000 720000 and the bin chr1 720000 730000, therefore it duplicates it.
Indeed trying with you first contacts, using hicConvertFormat I got:
$ cooler dump --join output.cool
chr1 710000 720000 chr1 750000 760000 17
chr1 720000 730000 chr1 760000 770000 17
chr1 940000 950000 chr1 990000 1000000 15
chr1 940000 950000 chr1 1180000 1190000 10
chr1 940000 950000 chr1 1250000 1260000 11
chr1 950000 960000 chr1 1000000 1010000 15
chr1 950000 960000 chr1 1190000 1200000 10
chr1 950000 960000 chr1 1260000 1270000 11
A solution waiting for this bug to be fixed is to use cooler:
$ cooler load -f bg2 hg19_chr_sizes.txt:10000 input.bed test.cool
$ cooler dump --join test.cool
chr1 710000 720000 chr1 750000 760000 17
chr1 940000 950000 chr1 990000 1000000 15
chr1 940000 950000 chr1 1180000 1190000 10
chr1 940000 950000 chr1 1250000 1260000 11
Hi, thank you so much, I would definitely give it a try!
@joachimwolff The issue is here: https://github.com/deeptools/HiCExplorer/blob/20be92139dcfc6b4b215926b7fd8f4c7f6f48165/hicexplorer/hicConvertFormat.py#L210-L213 I think you were not expecting to find 2 bins for a region and you may consider using end-1.