cabinetry Merging samples for model prediction

Discussed in https://github.com/scikit-hep/cabinetry/discussions/501

^{Originally posted by meschrei February 12, 2025}

Hi, I just wanted to ask if there is an easy way to get the summed yields and their uncertainties of a sub set of samples in cabinetry. If you have a couple of samples for your signal or background that you want to be treated individually by the fit, but in the end you only care about the total amount of background or signal yields. I saw that you can easily get the total, but that would contain signal as well as background in this example. Summing the yields is relatively easy, but getting the correct uncertainties is probably a bit more tricky.

So, is there a way to combine a subset of samples after the fit procedure in this way?

See #501 for some more discussion. This would presumably require changing ModelPrediction.model and replacing that by just the relevant metadata pieces we need and can edit more easily (e.g. changing the list of samples).

Feb 12 '25 13:02 alexander-held

I am happy to be assigned this. I guess this is a change after the fit, where the correct summing and propagation of uncertainties need to be done and all metadata updated accordingly. Is that correct?

Feb 14 '25 11:02 MoAly98

Steps to achieve the implementation via the API:

[ ] The model_utils.prediction function to accept an object explaining the grouping
[ ] The yields and uncertainties need to be combined correctly
[ ] The pyhf.model needs to be correctly manipulated to reflect the new list of samples
[ ] Implementation needs to be harmonised with sample grouping via a configuration

Things to think about:

Should this implementation be only done to the model post-fit? If we want to group samples pre-fit, maybe the implementation needs to be when actually building the pyhf model in model_utils.model_and_data.
?

Feb 14 '25 13:02 MoAly98

From what I've seen it is difficult to change the pyhf.Model object dynamically. It is also not generally possible to build a new model that internally has samples merged but behaves correctly, e.g. in the case where a normalization factor acts on only a part of the samples that get merged. The samples would need to stay split in the model for pyhf to support this correctly. Instead what I was thinking about is to change the model withintheModelPredictionto not be thepyhf.Modelbut instead to be a very lightweight container that behaves likemodel.configand returns all the information that gets accessed when operating on theModelPrediction` object subsequently. There we would then be free to change things like the list of samples to what we need.

I think the implementation should be done in a way that works pre-fit as well. model_utils.prediction could take an optional keyword argument for that and pass it through to model_utils.yield_stdev.

Feb 17 '25 12:02 alexander-held

Here is something I wrote recently to merge samples for plotting, but it does not get the per-sample uncertainties correct because to calculate those, the merge needs to happen before the model_utils.yield_stdev calculation.

import numpy as np

# need to:
# - partially merge samples
# - change model to have different list of sample names

# WARNING: this does NOT update the uncertainty for the stack of the samples that get merged,
# evaluating that correctly is impossible at this stage and would require changes in the
# calculation for the ModelPrediction object, however this is not needed when only plotting
# the uncertainty for the total model prediction, such as in a normal data/MC plot

# partially merge samples
new_model_yields = []
SAMPLES_TO_SUM = ["small_bkg_1", "small_bkg_2"]
sample_indices = [idx for idx, sample in enumerate(prediction.model.config.samples) if sample in SAMPLES_TO_SUM]
WHERE_TO_INSERT = 1  # this is where in the stack the new merged sample goes, 0 is at the bottom
for i_ch, channel in enumerate(prediction.model.config.channels):
    # for each channel, sum together the desired samples
    summed_sample = np.sum(np.asarray(prediction.model_yields[i_ch])[sample_indices], axis=0)
    # build set of remaining samples and remove the ones already summed
    remaining_samples = np.delete(prediction.model_yields[i_ch], sample_indices, axis=0)
    model_yields = np.insert(remaining_samples, WHERE_TO_INSERT, summed_sample, axis=0)
    new_model_yields.append(model_yields)

# change model to have different list of sample names
# we cannot easily change that list in the pyhf object but we build a fake object that provides what we need
class MockConfig():
    def __init__(self, model, merged_sample):
        self.samples = model.config.samples
        self.channel_nbins = model.config.channel_nbins
        self.channels = model.config.channels
        self.channel_slices = model.config.channel_slices
        # update list of samples
        self.samples = model.config.samples
        self.samples = np.delete(model.config.samples, sample_indices)
        self.samples = np.insert(self.samples, WHERE_TO_INSERT, merged_sample, axis=0)
        self.samples = self.samples.tolist()
        

class MockModel():
    def __init__(self, model, merged_sample):
        self.config = MockConfig(model, merged_sample)


MERGED_SAMPLE_NAME = "Other"
new_model = MockModel(prediction.model, MERGED_SAMPLE_NAME)

# build new model prediction object that behaves as intended
new_prediction = cabinetry.model_utils.ModelPrediction(
    new_model,
    new_model_yields,  # with samples merged
    prediction.total_stdev_model_bins,
    prediction.total_stdev_model_channels,
    prediction.label,
)

The sample summing could happen in model_utils.yield_stdev instead. The MockConfig here exists to change the ModelPrediction so that it has the "correct" list of samples for subsequent operations, e.g. when model.config.samples is called to read that out for plotting. This is a fragile solution, so replacing the model completely with something that only carries the relevant information seems better to me. From what I remember, the reason I originally added the pyhf.Model to ModelPrediction was mostly because that meant not inventing a new data structure that carries all the relevant information. The use case here seems good enough to me to change approach here though.

Feb 17 '25 12:02 alexander-held