boost-histogram icon indicating copy to clipboard operation
boost-histogram copied to clipboard

Multi weight storage

Open Superharz opened this issue 7 months ago • 12 comments

This PR adds a multi weight storage type (called MultiWeight) to boost-histogram. It addresses #83 and is bases on a prototype provided by @HDembinski in https://github.com/boostorg/histogram/issues/211.

It allows one to create a histogram that can store multiple independent weights per bin (the number of weights per bin has to be specified when creating the histogram).

Example

Create and fill a 1 dim histogram that stores 3 weights per bin:

import boost_histogram as bh
import numpy as np
x = np.array([1, 2])
weights = np.array([
        [1, 2, 3],
        [4, 5, 6]
    ])
h = bh.Histogram(bh.axis.Regular(5, 0, 5), storage = bh.storage.MultiWeight(3))
h.fill(x, sample = weights)

h[1] = [1, 2, 3]
h[2] = [4, 5, 6]

Status of this PR

The MultiWeight histograms can be created in python and they can be filled. Pickle also works.

The buffer and view for the histograms has yet to be implemented.

Superharz avatar May 26 '25 11:05 Superharz

Some numbers comparing a 8*8 2D MultiWeight histogram with 20k weights vs 20k normal (single weights) histograms: Filling both with 100k random values takes about 1min 26s for a loop over all 20k single weights histograms vs 1.71s to fill the MultiWeight histogram with the same data. All single weight histograms together take up about 51 MB of RAM on my machine vs the one MultiWeight histogram only taking up about 15.3 MB

Superharz avatar May 26 '25 14:05 Superharz

h.values() does not work yet. However, the important part (for me) is to fill the histograms not to retrieve the data from them. At the end, the data can be converted into a numpy array by a small loop over the histogram:

For a 8*8 2D histogram h with 20k weights:

a = np.zeros((8,8, 20000))
for i in range(8):
    for j in range(8):
        a[i,j] = np.array(h[i,j])

Superharz avatar May 26 '25 16:05 Superharz

Wow, that's quite an impressive patch. I really appreciate the benchmarks, which nicely confirm the expected benefits. I hope you have more time to implement the changes.

I suggest we work on the implementation here and backport it to boostorg/histogram later. In the end, both libraries should be in sync.

HDembinski avatar May 27 '25 08:05 HDembinski

Hi @HDembinski . Thank you for reviewing this. How would you like me to address your comments? Should I do one commit per comment or fix everything and do one commit then?

Superharz avatar May 27 '25 08:05 Superharz

Within a PR you don't need to do clean commits, I won't look at the commit history. Feel free to fix multiple issues in one commit.

HDembinski avatar May 27 '25 08:05 HDembinski

By the way, once this feature is done, I suggest you present this at the next PyHEP https://indico.cern.ch/event/1515852/ or the one next year, depending on how quickly we get this done. It is a major feature, and you deserve recognition for implementing this. I left science, so I don't go to any of these workshops anymore.

HDembinski avatar May 27 '25 08:05 HDembinski

Hi @HDembinski, I addressed the first batch of your comments. For the remaining ones I need some input from your side on how to continue :)

Superharz avatar May 27 '25 15:05 Superharz

h.fill(x, sample = weights)

I didn't notice before, but you use sample to pass the weights, that's breaking interface assumptions. Weights should be passed with the weight keyword. I see that I did that, too, in my demo. I guess it was the easiest hack to make it work, but that's breaking all kinds of interface contracts.

HDembinski avatar Jun 22 '25 12:06 HDembinski

h.fill(x, sample = weights)

I didn't notice before, but you use sample to pass the weights, that's breaking interface assumptions. Weights should be passed with the weight keyword.

I completely agree. This was taken over from your original prototype and it worked, therefore I kept it.

Superharz avatar Jun 22 '25 12:06 Superharz

However, it is kinda nice that it basically works with sample, so maybe it is enough to change the name of the storage to multi_count, so avoid having name "weight" in the name which gives the wrong idea.

HDembinski avatar Jun 22 '25 12:06 HDembinski

@Superharz I considering moving development of this feature back to boostorg/histogram. I find it easier to develop C++ code there. I will merge your changes here to that branch.

HDembinski avatar Jun 22 '25 12:06 HDembinski

Upstream: https://github.com/boostorg/histogram/pull/411. I assume we can work on this, and move to upstream whenever that goes in and gets released. I've fixed up the clang-tidy recommendations, and started adding some of the missing things. I added .nelem, and added it to the repr. It also needs to support .view, .values, etc.

This is far enough from being ready that I'm not going to try to get it in for the next release, but it can probably be in the following one.

henryiii avatar Jul 31 '25 18:07 henryiii