cooltools
cooltools copied to clipboard
Downsampling and cis counts
Hi, A part of this bug has already been reported here, I have the same issue on cooltools 0.5.4 version.
Also while writing this issue, I realized I had maybe too many questions for one post, sorry about that.
I have some cool files I want to downsample to a given cis contact count for all my replicates and different conditions to have the same number of cis contacts, so I can compare them (first of all, am I right on the usage of downsampling as a way to normalize sample coverage between conditions?)
1/ get the cis contact count
With cooltools.coverage() or with expected_cis()
cooltools.coverage(some_cool_file, store=True, ignore_diags=0)
some_cool_file.info['cis']
# >>> 309426839
cooltools.coverage(some_cool_file, store=True)
some_cool_file.info['cis']
# >>> 210709980 # because --ignore-diags = 2 in cooler balance
np.sum(some_cool_file.bins()[:]['cov_cis_raw'])
# >>> 618853678
np.sum(some_cool_file.bins()[:]['cov_cis_raw'])/2
# >>> 309426839.0
expected_temp = cooltools.expected_cis(some_cool_file, clr_weight_name = None, ignore_diags=0, view_df=gnm_arms, chunksize=1000000, nproc = nproc, smooth=False, aggregate_smoothed=False)
np.sum(expected_temp['count.sum'])
# >>> 309426839.0
Counts doubled in cov_cis_raw as already reported here
2 / Downsampling
I have unbalanced cool file as
some_cool_file.bins()[:]
chrom start end cov_cis_raw cov_tot_raw
0 chr1 0 10000 0 0
1 chr1 10000 20000 0 0
2 chr1 20000 30000 0 0
3 chr1 30000 40000 0 0
4 chr1 40000 50000 0 0
... ... ... ... ... ...
272561 chrY 91720000 91730000 0 0
272562 chrY 91730000 91740000 0 0
272563 chrY 91740000 91744698 0 0
272564 chrM 0 10000 258 3197
272565 chrM 10000 16299 270 3362
[272566 rows x 5 columns]
That I want to downsample:
cooltools.sample(clr = some_cool_file, out_clr_path = dnsmpl_cool_file, cis_count = 250000000)
Traceback (most recent call last):
File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/cooltools/api/coverage.py", line 117, in coverage
else clr._load_attrs(clr.root.rstrip("/") + "/bins/weight")["ignore_diags"]
File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/cooler/api.py", line 116, in _load_attrs
return dict(grp[path].attrs)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/h5py/_hl/group.py", line 357, in __getitem__
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'weight' doesn't exist)"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/cooltools/api/sample.py", line 102, in sample
cis_total = clr.info.get("cis", np.sum(coverage(clr)[0] // 2, dtype=int))
File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/cooltools/api/coverage.py", line 120, in coverage
raise ValueError(
ValueError: Please, specify ignore_diags and/or IC balance this cooler! Cannot access the value used in IC balancing.
- Is it possible to sample contact on raw matrices? What is the point of doing downsampling after balancing?
Also
ValueError: Please, specify ignore_diags and/or IC balance this cooler! Cannot access the value used in IC balancing. '
But ignore_diags is not an argument from sample()
cooltools.sample(clr = cooler_list[ndx], out_clr_path = dnsmpl_cool_file, cis_count = cis_contact_trgt, ignore_diags=0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: sample() got an unexpected keyword argument 'ignore_diags'
- To bypass the balancing in sample(), can I duplicate 'cov_cis_raw' columns in .bins() and call it 'weight' so it fakes balance cool file and downsample on raw cis contacts? And then I'll need to balance it anyway after the downsampling.
- Usually, is downsampling done on full matrix or do you ignore first(s) diagonal for some reason?
- additional question:
store (bool, optional) – If True, store the results in the input cooler file when finished. Does it mean that the result is stored in the python variable or directly modified in the cool file? I am not sure to understand whether the matrices are loaded and only loaded version is modified or if the original file it-self is modified.
Thanks for the support, David
# session_info.show()
# -----
# bbi 0.3.5
# bioframe 0.4.1
# click 8.1.4
# cooler 0.9.2
# coolpuppy NA
# cooltools 0.5.4
# cytoolz 0.12.1
# ipywidgets 8.0.7
# matplotlib 3.7.2
# mpl_toolkits NA
# numpy 1.24.4
# pandas 1.5.3
# pysam 0.21.0
# scipy 1.11.1
# seaborn 0.12.2
# session_info 1.0.0
# -----
# Python 3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0]
# Linux-5.3.18-150300.59.60-default-x86_64-with-glibc2.31
# -----
# Session information updated at 2023-07-28 12:37
These are good questions, although there is a lot to unpack here. I think the most important idea I can give you is to calculate coverage with ignore_diags=0
, I think it should solve all your problems...
The values are indeed stored in the file, and .sample()
uses the cis count when it is available in the file. When not, it calculates the coverage with default arguments (! perhaps not ideal and should be changed to ignore_diags=0 !) and uses that.
discussion 2023/10/9:
perhaps we should deprecate the usage of stored cis counts in sample()
, as it is not obvious how to link this stored value with the number of diagonals ignored in previous coverage()
calculation.
https://github.com/open2c/cooltools/blob/0a0d8417099f182e1e24b3897fdf41de6b08844a/cooltools/api/sample.py#L100