Merging samples for model prediction
Discussed in https://github.com/scikit-hep/cabinetry/discussions/501
Originally posted by meschrei February 12, 2025
Hi, I just wanted to ask if there is an easy way to get the summed yields and their uncertainties of a sub set of samples in cabinetry. If you have a couple of samples for your signal or background that you want to be treated individually by the fit, but in the end you only care about the total amount of background or signal yields. I saw that you can easily get the total, but that would contain signal as well as background in this example. Summing the yields is relatively easy, but getting the correct uncertainties is probably a bit more tricky.
So, is there a way to combine a subset of samples after the fit procedure in this way?
See #501 for some more discussion. This would presumably require changing ModelPrediction.model and replacing that by just the relevant metadata pieces we need and can edit more easily (e.g. changing the list of samples).
I am happy to be assigned this. I guess this is a change after the fit, where the correct summing and propagation of uncertainties need to be done and all metadata updated accordingly. Is that correct?
Steps to achieve the implementation via the API:
- [ ] The
model_utils.predictionfunction to accept an object explaining the grouping - [ ] The yields and uncertainties need to be combined correctly
- [ ] The
pyhf.modelneeds to be correctly manipulated to reflect the new list of samples - [ ] Implementation needs to be harmonised with sample grouping via a configuration
Things to think about:
- Should this implementation be only done to the model post-fit? If we want to group samples pre-fit, maybe the implementation needs to be when actually building the
pyhfmodel inmodel_utils.model_and_data. - ?
From what I've seen it is difficult to change the pyhf.Model object dynamically. It is also not generally possible to build a new model that internally has samples merged but behaves correctly, e.g. in the case where a normalization factor acts on only a part of the samples that get merged. The samples would need to stay split in the model for pyhf to support this correctly. Instead what I was thinking about is to change the model withintheModelPredictionto not be thepyhf.Modelbut instead to be a very lightweight container that behaves likemodel.configand returns all the information that gets accessed when operating on theModelPrediction` object subsequently. There we would then be free to change things like the list of samples to what we need.
I think the implementation should be done in a way that works pre-fit as well. model_utils.prediction could take an optional keyword argument for that and pass it through to model_utils.yield_stdev.
Here is something I wrote recently to merge samples for plotting, but it does not get the per-sample uncertainties correct because to calculate those, the merge needs to happen before the model_utils.yield_stdev calculation.
import numpy as np
# need to:
# - partially merge samples
# - change model to have different list of sample names
# WARNING: this does NOT update the uncertainty for the stack of the samples that get merged,
# evaluating that correctly is impossible at this stage and would require changes in the
# calculation for the ModelPrediction object, however this is not needed when only plotting
# the uncertainty for the total model prediction, such as in a normal data/MC plot
# partially merge samples
new_model_yields = []
SAMPLES_TO_SUM = ["small_bkg_1", "small_bkg_2"]
sample_indices = [idx for idx, sample in enumerate(prediction.model.config.samples) if sample in SAMPLES_TO_SUM]
WHERE_TO_INSERT = 1 # this is where in the stack the new merged sample goes, 0 is at the bottom
for i_ch, channel in enumerate(prediction.model.config.channels):
# for each channel, sum together the desired samples
summed_sample = np.sum(np.asarray(prediction.model_yields[i_ch])[sample_indices], axis=0)
# build set of remaining samples and remove the ones already summed
remaining_samples = np.delete(prediction.model_yields[i_ch], sample_indices, axis=0)
model_yields = np.insert(remaining_samples, WHERE_TO_INSERT, summed_sample, axis=0)
new_model_yields.append(model_yields)
# change model to have different list of sample names
# we cannot easily change that list in the pyhf object but we build a fake object that provides what we need
class MockConfig():
def __init__(self, model, merged_sample):
self.samples = model.config.samples
self.channel_nbins = model.config.channel_nbins
self.channels = model.config.channels
self.channel_slices = model.config.channel_slices
# update list of samples
self.samples = model.config.samples
self.samples = np.delete(model.config.samples, sample_indices)
self.samples = np.insert(self.samples, WHERE_TO_INSERT, merged_sample, axis=0)
self.samples = self.samples.tolist()
class MockModel():
def __init__(self, model, merged_sample):
self.config = MockConfig(model, merged_sample)
MERGED_SAMPLE_NAME = "Other"
new_model = MockModel(prediction.model, MERGED_SAMPLE_NAME)
# build new model prediction object that behaves as intended
new_prediction = cabinetry.model_utils.ModelPrediction(
new_model,
new_model_yields, # with samples merged
prediction.total_stdev_model_bins,
prediction.total_stdev_model_channels,
prediction.label,
)
The sample summing could happen in model_utils.yield_stdev instead. The MockConfig here exists to change the ModelPrediction so that it has the "correct" list of samples for subsequent operations, e.g. when model.config.samples is called to read that out for plotting. This is a fragile solution, so replacing the model completely with something that only carries the relevant information seems better to me. From what I remember, the reason I originally added the pyhf.Model to ModelPrediction was mostly because that meant not inventing a new data structure that carries all the relevant information. The use case here seems good enough to me to change approach here though.