HiCExplorer icon indicating copy to clipboard operation
HiCExplorer copied to clipboard

hicMergeMatrixBins changes the chr size

Open cgirardot opened this issue 2 years ago • 12 comments

I just noticed that (raw) matrices merged with hicMergeMatrixBins (v 3.7.2) have altered chr size. For example:

The initial 1K bin matrix has:

 hicInfo -m HiC_2-4h_R1_all-rep-merged.1Kb-bin-matrix.h5
# Matrix information file. Created with HiCExplorer's hicInfo version 3.7.2
File:	HiC_2-4h_R1_all-rep-merged.1Kb-bin-matrix.h5
Size:	144,916
Bin_length:	1000
Sum of matrix:	211759338.0
Chromosomes:length: chr2L: 23513712 bp; chr2R: 25286936 bp; chr3L: 28110227 bp; chr3R: 32079331 bp; chr4: 1348131 bp; chrM: 19524 bp; <CUT for clarity>

After hicMergeMatrixBins-3.7.2 -o HiC_2-4h_R1_10K_3.7.2.h5 -nb 10 -m HiC_2-4h_R1_all-rep-merged.1Kb-bin-matrix.h5 :

hicInfo -m HiC_2-4h_R1_10K_3.7.2.h5 
# Matrix information file. Created with HiCExplorer's hicInfo version 3.7.2
File:	HiC_2-4h_R1_10K_3.7.2.h5
Size:	14,143
Bin_length:	10000
Sum of matrix:	210935523.0
Chromosomes:length: chr2L: 23510000 bp; chr2R: 25286936 bp; chr3L: 28110000 bp; chr3R: 32079331 bp; chr4: 1348131 bp; chrM: 19524 bp; <CUT for clarity>

Notice the chr2L, chr3L. I fear this can lead to issues later

cgirardot avatar Mar 21 '22 13:03 cgirardot

I confirm that this issue give rise to problems in comparing matrices (hicCompareMatrices) that have been generated by bin merging.

(version 3.7.2)

sebastian-gregoricchio avatar Apr 24 '22 09:04 sebastian-gregoricchio

@sebastian-gregoricchio I was indeed suspecting (had this pb before I am pretty sure). How did you solve this?

cgirardot avatar Apr 25 '22 11:04 cgirardot

@cgirardot Actually I had to regenerate the matrix from the beginning directly we the desired resolution. before I was generating a 5kb matrix, and then merging the bins to get 20kb, 40kb, 100kb matrices. I wanted to subtract matrices from 2 different conditions at 40kb and was not working. So I generated directly the 40kb matrices and then it was working.

sebastian-gregoricchio avatar Apr 25 '22 14:04 sebastian-gregoricchio

@sebastian-gregoricchio I see. Thx. I would have tried maybe to dump it and re-create it with cooler.

cgirardot avatar Apr 25 '22 15:04 cgirardot

I believe that comes from some rounding in the last bin. I do not get how had issue downstream though. Can you elaborate a bit on that?

LeilyR avatar Apr 25 '22 15:04 LeilyR

sorry I dont have a concrete example to provide. It is anyway weird that some chr are rounded and not others (see initial post). I think this should not happen and be fixed if possible

cgirardot avatar Apr 25 '22 15:04 cgirardot

I labeled it, so we will have a deeper look at it.

LeilyR avatar Apr 25 '22 15:04 LeilyR

I believe that comes from some rounding in the last bin. I do not get how had issue downstream though. Can you elaborate a bit on that?

Actually I do not have an exact message right now, but I did the following steps:

  • Make a cool file at resolution h5 (starting from 5kb)
  • Merge 1, 4, 8, 20 bins to get the 10kb, 20kb, 40kb, 100kb resolutions
  • Summed the matrices of the same resolution by condition (in my case Tumor samples and normal samples)
  • Normalization and correction
  • Then I wanted to make a FoldChange or Difference matrix of Tumor/Normal or Tumor-Normal

When doing the last step for the 40kb resolution matrices (obtained by bin merging), hicMatricesCompare returned an error like: The size of the chromosomes in file A differs from chr sizes in file B.

When instead I start doing directly a matrix at 40kb resolution from step 1 everything works fine

sebastian-gregoricchio avatar Apr 25 '22 15:04 sebastian-gregoricchio

this sounds very familiar. I might even have mentioned this in a previous issue.

cgirardot avatar Apr 25 '22 15:04 cgirardot

I am also getting the same error The two matrices have different chromosome order. Use the tool hicAdjustMatrix to change the order. Merge1.cool: odict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT']) Merge4.cool: odict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT'])

What is the solution for this

KaurKaram avatar Mar 02 '23 21:03 KaurKaram

Hi, If you are working with cool files, my solution would be to re create the cool file, here are the lines:

binsize=40
chromSizeFile="yourgenome.fa.fai"
inputMatrix="whatever.cool"
outputMatrix="whatever_fixed.cool"
cooler dump --join ${inputMatrix}  | cooler load --format bg2 "${chromSizeFile}:${bin}000" - ${outputMatrix}

lldelisle avatar Mar 03 '23 10:03 lldelisle

Hi everyone,

It seems that the issue for chromosome length changing after lowering the resolution (merging bins) comes from these lines (248-249) from HicMergeMatrixBins.py

if count < num_bins / 2: log.debug("{} has few bins ({}). Skipping it\n".format(prev_ref, count))

It appears when reaching the end bins of a given chromosome. If the number of remaining bins is lower than half the desired number of bins to merge, it will simply discard those bins and the end of the chromosome will become the end of the last merged bins.

For example, let's say you have a 1kb matrix and you want to obtain a 25kb resolution one.

  • Your chromosome length is 15,008,000 bp long.
  • Your number of bins num_bins = 25.

The iterations will perform binning/merging for the first 600 x 25kb bins (=15,000,000 bp). Once it reached this point, it will try to merge the remaining 8kbp (so 8 bins since the initial resolution is 1kb) but since 8 is < num_bins/2 (= 25/2 = 12,5), it will discard the remaining bins. In the end, your chromosome length would be the last bin end you have, ie 15,000,000.

@LeilyR @lldelisle Let's say one wants to keep all the bins, do you think it's safe to bypass this discarding filter and use all the remaining bins even if it's less than num_bins/2 ? If so could it be made as an option --keepLastBin ?

Best,

A

u-n-i-v-e-r-z avatar Apr 25 '23 11:04 u-n-i-v-e-r-z