bigtools icon indicating copy to clipboard operation
bigtools copied to clipboard

Allow `bin_size` as paramter in `values`

Open ghuls opened this issue 11 months ago • 3 comments

values only supports bins, which is can give "weird" results when specifying more bins than there are in the range.

It is sometimes convenient to specify the size of the bins instead (e.g. 10 bp).

In [56]: bw.values("chr2L" , 1000, 1010)
Out[56]: 
array([0.70637   , 0.70637   , 0.84764397, 0.84764397, 0.84764397,
       0.84764397, 0.84764397, 0.84764397, 0.98891902, 1.13019001])

In [57]: bw.values("chr2L" , 1000, 1010, bins=5)
Out[57]: array([0.70637   , 0.84764397, 0.84764397, 0.84764397, 1.05955452])

In [58]: bw.values("chr2L" , 1000, 1010, bins=6)
Out[58]: 
array([0.70637   , 0.84764397, 0.84764397, 0.84764397, 0.84764397,
       1.13019001])

In [59]: bw.values("chr2L" , 1000, 1010, bins=20)
Out[59]: 
array([       nan, 0.70637   ,        nan, 0.        ,        nan,
       0.84764397,        nan, 0.84764397,        nan, 0.84764397,
              nan, 0.84764397,        nan, 0.84764397,        nan,
       0.        ,        nan, 0.        ,        nan, 0.        ])

In [60]: bw.values("chr2L" , 1000, 1010, bins=15)
Out[60]: 
array([       nan, 0.70637   , 0.        ,        nan, 0.84764397,
       0.84764397,        nan, 0.84764397, 0.84764397,        nan,
       0.84764397, 0.        ,        nan,        nan, 0.        ])

ghuls avatar Jan 22 '25 17:01 ghuls

Add bin_size is fine, but...should also just make bins > end-start work "as expected".

jackh726 avatar Jan 23 '25 20:01 jackh726

Could do this one of two ways:

  1. Assuming "stepped" values, which is basically how it's done today
  2. Do a linear interpolation between values

jackh726 avatar Jan 23 '25 20:01 jackh726

This begs the question of whether to support a full histogram interface, where you can define your bins by a sequence of edge points.

By analogy to numpy.hist, though I'm not necessarily recommending this API:

bw.hist(chrom, range=(start, end), bins=n_bins)

binedges = np.linspace(start, end, n_bins, dtype=int)
bw.hist(chrom, bins=binedges)

binedges = np.arange(start, end, bin_size)
bw.hist(chrom, bins=binedges)

nvictus avatar Jan 23 '25 20:01 nvictus