cooler
cooler copied to clipboard
cooler balance drops many interactions
I have a cool file that appears to be fine - there are reads in pretty much every bin. Here it is as a ginteractions file:
NC_042997.1 0 1000000 NC_042997.1 0 1000000 63334
NC_042997.1 0 1000000 NC_042997.1 1000000 2000000 6563
NC_042997.1 0 1000000 NC_042997.1 2000000 3000000 3641
NC_042997.1 0 1000000 NC_042997.1 3000000 4000000 2913
NC_042997.1 0 1000000 NC_042997.1 4000000 5000000 2443
NC_042997.1 0 1000000 NC_042997.1 5000000 6000000 2192
NC_042997.1 0 1000000 NC_042997.1 6000000 7000000 1757
NC_042997.1 0 1000000 NC_042997.1 7000000 8000000 1460
NC_042997.1 0 1000000 NC_042997.1 8000000 9000000 1311
NC_042997.1 0 1000000 NC_042997.1 9000000 10000000 1348
NC_042997.1 0 1000000 NC_042997.1 10000000 11000000 706
NC_042997.1 0 1000000 NC_042997.1 11000000 12000000 517
NC_042997.1 0 1000000 NC_042997.1 12000000 13000000 485
.
.
.
et cetera
And here's an image after converting to mcool:
![Screen Shot 2020-09-12 at 10 26 58 AM](https://user-images.githubusercontent.com/3123273/93001248-92266f00-f4e2-11ea-8f9a-34408cf285ac.png)
When I run cooler balance --force
, the output has A LOT of bins dropped. Note how the start of the file is now at 11Mb
NC_042997.1 11000000 12000000 NC_042997.1 11000000 12000000 0.12287012107273232
NC_042997.1 11000000 12000000 NC_042997.1 14000000 15000000 0.009179471469380696
NC_042997.1 11000000 12000000 NC_042997.1 16000000 17000000 0.0022613667508674393
NC_042997.1 11000000 12000000 NC_042997.1 17000000 18000000 0.0017934892086176287
NC_042997.1 11000000 12000000 NC_042997.1 19000000 20000000 0.0015706112594678617
NC_042997.1 11000000 12000000 NC_042997.1 20000000 21000000 0.0016067216658818219
NC_042997.1 11000000 12000000 NC_042997.1 22000000 23000000 0.0026443289772690387
NC_042997.1 11000000 12000000 NC_042997.1 23000000 24000000 0.0017619224122503929
NC_042997.1 11000000 12000000 NC_042997.1 32000000 33000000 0.001725912256081153
NC_042997.1 11000000 12000000 NC_042997.1 33000000 34000000 0.0015621264682953127
NC_042997.1 11000000 12000000 NC_042997.1 39000000 40000000 0.0016796723318025406
NC_042997.1 11000000 12000000 NC_042997.1 40000000 41000000 0.0010349507443594623
NC_042997.1 11000000 12000000 NC_042997.1 42000000 43000000 0.0013117895675163343
NC_042997.1 11000000 12000000 NC_042997.1 43000000 44000000 0.0015743365358010337
NC_042997.1 11000000 12000000 NC_042997.1 44000000 45000000 0.0014398805814632809
NC_042997.1 11000000 12000000 NC_042997.1 49000000 50000000 0.0014603177220181285
NC_042997.1 11000000 12000000 NC_042997.1 51000000 52000000 0.0018351809986778198
.
.
.
et cetera
And here is the balanced matrix after converting to an mcool and visualizing.
![Screen Shot 2020-09-12 at 10 27 13 AM](https://user-images.githubusercontent.com/3123273/93001287-cd28a280-f4e2-11ea-923f-4d9d338b61f9.png)
Do you have any idea what could be going on? Seems like something is not working the way it should. Thank you!
cooler balance
is trying to ensure convergence of balancing algorithm by filtering out (ignoring) "misbehaving" bins: misbehaving bins are ones that have some sort of coverage "issues" - (1) too little interactions (controlled by --min-count
parameter), (2) not enough non-zero pixels (controlled by --min-nnz
parameter), (3) coverage of a given bin deviates too much from the rest (controlled by --mad-max
parameter). https://cooler.readthedocs.io/en/latest/cli.html#cooler-balance
(1) and (2) are related of course, but --min-nnz
allows one to avoid extreme situations where e.g. there are handful of super-bright pixels (i.e. --min-count
is satisfied), but all others zeroes - it does not look like you have that problem - by looking at your raw heatmap
(3) - maybe tricky to understand at first but seems like exactly what you'd need to adjust. This filter first calculates , sort of "average bin coverage" per chromosome (median to be exact), and then it check if a coverage of an individual bin is deviating too much from the "average" . The "too much" in this context is measure in MADs - median absolute deviations - aka median deviation from the median (argh ...). Anyhow, the default --mad-max 5
is perhaps too stringent for your data - you could try something like --mad-max 10
, or more ...
Also you might want to explore a bit further why the coverage in your data has such a wide distribution ? Is there a biological reason for that ? What organism is this ?
PS. words "coverage" and "marginals" are used interchangeably in this context. And they are roughly sum of interactions along the row(column) of the heatmap .
PPS there is some code for calculating raw coverage from a binned cooler https://github.com/mirnylab/cooltools/blob/master/cooltools/coverage.py if you wish to explore that further ... Or you can just calculate sums of rows in a raw heatmap if the data is small enough to fit in memory
Thanks for your response, Sergey. This is a few Hi-C libraries on SRA from Octopus sinensis. No one has published anything about AB-compartments or TADs in spiralians before, so I was taking a look. It could be that the library has too many short inserts and the log decay as distance increases is very rapid. That would explain the wide distribution of coverage.
Edit/Update:
After plotting the z-score of the bins, this looks unlike any dataset that I've ever seen. I think this problem can be fixed by increasing the bin size.
Marking as resolved. Please re-open if you are still encountering issues.