boost-histogram icon indicating copy to clipboard operation
boost-histogram copied to clipboard

Initializing histograms from arrays of bin values

Open alexander-held opened this issue 4 years ago • 14 comments

Is it possible to set the bin contents (and variances, for storage=bh.storage.Weight()) of a bh.Histogram() directly without having to .fill() in events?

I assume that aghast could be used to create a histogram from this information and then convert it. Is there a more direct way?

For context: I am considering using bh.Histogram or possibly the hist version as a container for histograms in another library. For this I would like to have the flexibility to create histograms in multiple ways.

alexander-held avatar Jul 15 '20 16:07 alexander-held

Two ways:

h = bh.Histogram(bh.axis.Regular(10,0,1), storage=bh.storage.Weight())
values = np.ones(10)
variances = np.ones(10) * .1

Using the shortcut for Weight Histograms:

h[...] = np.stack([values, variances], axis=-1)

For a AxBxC histogram, you need a AxBxCx2 array. You can optionally include the flow bins, and those will get set too.

Using the view method:

h.view().value = values
h.view().variance = variances

You can pass flow=True to .view() if you want to set the flow bins too.

henryiii avatar Jul 15 '20 17:07 henryiii

@alexander-held I use this feature myself. I fit invariant-mass distributions in eta and pt of some particle. Then I store the fitted particle yield and its uncertainty squared as "values" and "variances" in a histogram with Weight() storage. This way, the yields are not only nicely organized, but I can also add yields from fits of subsections of the data simply by adding the histograms.

HDembinski avatar Jul 16 '20 12:07 HDembinski

Thanks for the quick reply! There is no way to instantiate a Histogram and fill the value and variance at the same time, right? Something like this:

h = bh.Histogram(bh.axis.Regular(10,0,1), storage=bh.storage.Weight(), values=values, variances=variances)

This might be convenient to have (maybe something for hist?).

alexander-held avatar Jul 16 '20 16:07 alexander-held

No, but long constructors with many keywords are not a good design. It is not significantly more efficient to do this instead of assigning.

HDembinski avatar Jul 16 '20 17:07 HDembinski

Comment from the :peanuts: gallery: if you are going to have many end-users all extending the class to add a sub-set of the keyword arguments you can end up with a mess of slightly different sub-classes floating around the world. It could be better to standardize on how to manage that bit of complexity. You could do something like

def __init__(self, ..., **kwargs):
     # current init
     for k, v in kwargs.items():
         setattr(self, k, v)

It is not much code and makes the API a bit for ergonomic for your users.

tacaswell avatar Jul 16 '20 18:07 tacaswell

I was surprised that the default bh.storage.Double() storage changes the API and I could not use h.view().value anymore to set bin contents. h[...] = [...] still works. Does this mean I should prefer h[...] = ... over using .view()?

alexander-held avatar Jul 17 '20 13:07 alexander-held

.view() returns a view of the storage. It's always an ndarray, though for accumulator storages, it is also RecArray/RecordArray (I forget the naming scheme)-like, which allows attribute access into the dtype fields. That's why .view().value works. If you want to do the same thing with a simpler dtype, you have set the contents (this is a python limitation, you can't assign to a function call):

h.view()[...] = np.array(...)

This is actually good, because it emphases that you are changing the contents, rather than changing the object. You can do the same thing with .value, as well, I believe view().value[...].

Note: I am horribly mixing meanings above. The first ... is the literal Pyhton Ellipsis object, while the second one means "whatever you want to put in the array".

For a general way to do this regardless of the backend storage, see #423 - but that won't work very well nor is it designed for setting values on Profiles. Maybe we should expose and provide nice constructors or shortcuts for the AccumulatorViews?

henryiii avatar Jul 17 '20 17:07 henryiii

PS: of course, you can use h[...] =, and it works well, though you had to build an extra dimension in for the weighted storage (or you can use the actual dtype), while the simpler storages have a simple dtype.

henryiii avatar Jul 17 '20 17:07 henryiii

Comment from the 🥜 gallery: if you are going to have many end-users all extending the class to add a sub-set of the keyword arguments you can end up with a mess of slightly different sub-classes floating around the world. It could be better to standardize on how to manage that bit of complexity. You could do something like

def __init__(self, ..., **kwargs):
     # current init
     for k, v in kwargs.items():
         setattr(self, k, v)

It is not much code and makes the API a bit for ergonomic for your users.

Thank you @tacaswell for chiming in and making a valid point. My background is in C++ and there constructors with many arguments are frowned upon, hence my reaction...

HDembinski avatar Jul 18 '20 12:07 HDembinski

If we ever add the ability to use existing memory allocated in Python as the Storage's memory, this would suddenly make histograms initialized this way more efficient (and would have a side effect that the array passed in would start being the one that changes).

henryiii avatar Jul 19 '20 04:07 henryiii

Note: .view() does what it says it does - it returns a view of the underlying data. If it's non-simple, the view is non-simple, as it should be.

I've proposed an API for accessing the values() and variances() (if present) of all storages in #423 - it comes up in the context of making a standard API for "PlottableHistogram"s, which would be really nice if Scikit-HEP histograms could follow. It would also be useful though, as in your case, in making a standard way to get the values/variances regardless of the Histogram storage.

henryiii avatar Jul 19 '20 04:07 henryiii

I'm intending to implement this in Hist first (it's already available for pandas data frame storages, actually - but that one can't be upstreamed to boost-histogram because it depends on named axes). I'd like a classmethod, like used for that feature, but in this case, having the other args (axes, storage, etc) are useful, so it is probably better as a keyword only argument instead of forwarding a lot of stuff.

henryiii avatar Mar 09 '21 19:03 henryiii

Unless I'm mistaken, both h[...] = np.stack([values, variances], axis=-1) and h.view().value = values are not mentioned in the documentation. These are very useful, so it would be great if you could add them.

dcervenkov avatar Jun 16 '21 10:06 dcervenkov

Update: this is supported in Hist (as you can see from the linked merged PR above). I'd be fine to upstream it (as with any non-dependency addition to Hist) if @HDembinski would like a data= argument in bh.Histogram. It would not reuse memory currently, but if we were able to add that later, you'd immediately get the benefit of it without changing your code.

henryiii avatar Sep 15 '21 15:09 henryiii