bigtools
bigtools copied to clipboard
Allow `bin_size` as paramter in `values`
values only supports bins, which is can give "weird" results when specifying more bins than there are in the range.
It is sometimes convenient to specify the size of the bins instead (e.g. 10 bp).
In [56]: bw.values("chr2L" , 1000, 1010)
Out[56]:
array([0.70637 , 0.70637 , 0.84764397, 0.84764397, 0.84764397,
0.84764397, 0.84764397, 0.84764397, 0.98891902, 1.13019001])
In [57]: bw.values("chr2L" , 1000, 1010, bins=5)
Out[57]: array([0.70637 , 0.84764397, 0.84764397, 0.84764397, 1.05955452])
In [58]: bw.values("chr2L" , 1000, 1010, bins=6)
Out[58]:
array([0.70637 , 0.84764397, 0.84764397, 0.84764397, 0.84764397,
1.13019001])
In [59]: bw.values("chr2L" , 1000, 1010, bins=20)
Out[59]:
array([ nan, 0.70637 , nan, 0. , nan,
0.84764397, nan, 0.84764397, nan, 0.84764397,
nan, 0.84764397, nan, 0.84764397, nan,
0. , nan, 0. , nan, 0. ])
In [60]: bw.values("chr2L" , 1000, 1010, bins=15)
Out[60]:
array([ nan, 0.70637 , 0. , nan, 0.84764397,
0.84764397, nan, 0.84764397, 0.84764397, nan,
0.84764397, 0. , nan, nan, 0. ])
Add bin_size is fine, but...should also just make bins > end-start work "as expected".
Could do this one of two ways:
- Assuming "stepped" values, which is basically how it's done today
- Do a linear interpolation between values
This begs the question of whether to support a full histogram interface, where you can define your bins by a sequence of edge points.
By analogy to numpy.hist, though I'm not necessarily recommending this API:
bw.hist(chrom, range=(start, end), bins=n_bins)
binedges = np.linspace(start, end, n_bins, dtype=int)
bw.hist(chrom, bins=binedges)
binedges = np.arange(start, end, bin_size)
bw.hist(chrom, bins=binedges)