cooler icon indicating copy to clipboard operation
cooler copied to clipboard

cooler balance drops many interactions

Open conchoecia opened this issue 3 years ago • 2 comments

I have a cool file that appears to be fine - there are reads in pretty much every bin. Here it is as a ginteractions file:

NC_042997.1     0       1000000 NC_042997.1     0       1000000 63334
NC_042997.1     0       1000000 NC_042997.1     1000000 2000000 6563
NC_042997.1     0       1000000 NC_042997.1     2000000 3000000 3641
NC_042997.1     0       1000000 NC_042997.1     3000000 4000000 2913
NC_042997.1     0       1000000 NC_042997.1     4000000 5000000 2443
NC_042997.1     0       1000000 NC_042997.1     5000000 6000000 2192
NC_042997.1     0       1000000 NC_042997.1     6000000 7000000 1757
NC_042997.1     0       1000000 NC_042997.1     7000000 8000000 1460
NC_042997.1     0       1000000 NC_042997.1     8000000 9000000 1311
NC_042997.1     0       1000000 NC_042997.1     9000000 10000000        1348
NC_042997.1     0       1000000 NC_042997.1     10000000        11000000        706
NC_042997.1     0       1000000 NC_042997.1     11000000        12000000        517
NC_042997.1     0       1000000 NC_042997.1     12000000        13000000        485
.
.
.
et cetera

And here's an image after converting to mcool:

Screen Shot 2020-09-12 at 10 26 58 AM

When I run cooler balance --force, the output has A LOT of bins dropped. Note how the start of the file is now at 11Mb

NC_042997.1     11000000        12000000        NC_042997.1     11000000        12000000        0.12287012107273232
NC_042997.1     11000000        12000000        NC_042997.1     14000000        15000000        0.009179471469380696
NC_042997.1     11000000        12000000        NC_042997.1     16000000        17000000        0.0022613667508674393
NC_042997.1     11000000        12000000        NC_042997.1     17000000        18000000        0.0017934892086176287
NC_042997.1     11000000        12000000        NC_042997.1     19000000        20000000        0.0015706112594678617
NC_042997.1     11000000        12000000        NC_042997.1     20000000        21000000        0.0016067216658818219
NC_042997.1     11000000        12000000        NC_042997.1     22000000        23000000        0.0026443289772690387
NC_042997.1     11000000        12000000        NC_042997.1     23000000        24000000        0.0017619224122503929
NC_042997.1     11000000        12000000        NC_042997.1     32000000        33000000        0.001725912256081153
NC_042997.1     11000000        12000000        NC_042997.1     33000000        34000000        0.0015621264682953127
NC_042997.1     11000000        12000000        NC_042997.1     39000000        40000000        0.0016796723318025406
NC_042997.1     11000000        12000000        NC_042997.1     40000000        41000000        0.0010349507443594623
NC_042997.1     11000000        12000000        NC_042997.1     42000000        43000000        0.0013117895675163343
NC_042997.1     11000000        12000000        NC_042997.1     43000000        44000000        0.0015743365358010337
NC_042997.1     11000000        12000000        NC_042997.1     44000000        45000000        0.0014398805814632809
NC_042997.1     11000000        12000000        NC_042997.1     49000000        50000000        0.0014603177220181285
NC_042997.1     11000000        12000000        NC_042997.1     51000000        52000000        0.0018351809986778198
.
.
.
et cetera

And here is the balanced matrix after converting to an mcool and visualizing.

Screen Shot 2020-09-12 at 10 27 13 AM

Do you have any idea what could be going on? Seems like something is not working the way it should. Thank you!

conchoecia avatar Sep 12 '20 17:09 conchoecia

cooler balance is trying to ensure convergence of balancing algorithm by filtering out (ignoring) "misbehaving" bins: misbehaving bins are ones that have some sort of coverage "issues" - (1) too little interactions (controlled by --min-count parameter), (2) not enough non-zero pixels (controlled by --min-nnz parameter), (3) coverage of a given bin deviates too much from the rest (controlled by --mad-max parameter). https://cooler.readthedocs.io/en/latest/cli.html#cooler-balance

(1) and (2) are related of course, but --min-nnz allows one to avoid extreme situations where e.g. there are handful of super-bright pixels (i.e. --min-count is satisfied), but all others zeroes - it does not look like you have that problem - by looking at your raw heatmap

(3) - maybe tricky to understand at first but seems like exactly what you'd need to adjust. This filter first calculates , sort of "average bin coverage" per chromosome (median to be exact), and then it check if a coverage of an individual bin is deviating too much from the "average" . The "too much" in this context is measure in MADs - median absolute deviations - aka median deviation from the median (argh ...). Anyhow, the default --mad-max 5 is perhaps too stringent for your data - you could try something like --mad-max 10 , or more ...

Also you might want to explore a bit further why the coverage in your data has such a wide distribution ? Is there a biological reason for that ? What organism is this ?

PS. words "coverage" and "marginals" are used interchangeably in this context. And they are roughly sum of interactions along the row(column) of the heatmap .

PPS there is some code for calculating raw coverage from a binned cooler https://github.com/mirnylab/cooltools/blob/master/cooltools/coverage.py if you wish to explore that further ... Or you can just calculate sums of rows in a raw heatmap if the data is small enough to fit in memory

sergpolly avatar Sep 12 '20 19:09 sergpolly

Thanks for your response, Sergey. This is a few Hi-C libraries on SRA from Octopus sinensis. No one has published anything about AB-compartments or TADs in spiralians before, so I was taking a look. It could be that the library has too many short inserts and the log decay as distance increases is very rapid. That would explain the wide distribution of coverage.

Edit/Update:

After plotting the z-score of the bins, this looks unlike any dataset that I've ever seen. I think this problem can be fixed by increasing the bin size.

renamed 1000000 dist

conchoecia avatar Sep 12 '20 23:09 conchoecia

Marking as resolved. Please re-open if you are still encountering issues.

nvictus avatar Jan 24 '24 16:01 nvictus