root icon indicating copy to clipboard operation
root copied to clipboard

[RF] Support combined Chi2 fits with correlated binned data

Open guitargeek opened this issue 1 month ago • 6 comments

Feature description

This was requested by @TomasDado, and is indeed a feature that is not available in RooFit yet. The motivation is that nowadays, fits with the same dataset projected to many observables are getting increasingly common, as fully multidimensional models are very hard to validate but one still wants to still use a maximum of information from the data.

In a combined fit with the same dataset projected on multiple 1D distributions that represent the different channels, one has to be careful that the projections are not correlated. Otherwise one "double counts information" when summing the log-likelihood over channels. In practice, that often meas that information has to be thrown away because one has to avoid wrong results from correlated corrections.

One possible remedy is to do a chi2 approximation where the yield in each bin is Gaussian, and then one can do a multivariate Gaussian fit, considering the covariance between data bins.

However, RooFit/HistFactory is not made for correlated data bins. Every bin in a dataset is considered independent. It is possible to mock up such a fit using a RooMultiVarGaussian and a dataset with a single entry and a variable for each bin, but this is highly inefficient and doesn't correspond to HistFactory models at all.

We need to find a solution to fit HistFactory models to datasets on 1D projections and their correlations.

To illustrate the mathematical concept, here a toy example where two channels with two bins each are assumed to be 100 % correlated. In this case, the simultaneous fit should give the same result as fitting a single channel, because the second channel contains no more information. However, one can see that the parameter uncertainties reduce as if the sample size has doubled.

To account for this, one can instead fit a RooMultiVarGaussian with the following correlation matrix. The indices 0 and 1 correspond to the expected values in the two bins of the first channel, and 2 and 3 to the bins in the second channel. The off-diagonal terms fully correlate corresponding bins over the two channels. Note that the off diagonal terms need to be smaller than one by some epsilon in practice, to avoid a singular matrix.

4x4 matrix is as follows

     |      0    |      1    |      2    |      3    |
---------------------------------------------------------
   0 |          1           0           1           0
   1 |          0           1           0           1
   2 |          1           0           1           0
   3 |          0           1           0           1

Below is some demo code that shows the concept, with the following output:

Processing script.C...
Fitting single channel:
2000 +/- 44.7209
Fitting both channels with RooSimultaneous:
2000 +/- 31.6226

Fitting uncorrelated Gaussian approximation:
2000 +/- 31.6226

Fitting 50 % correlated Gaussian approximation:
2000 +/- 38.7296

Fitting 100 % correlated Gaussian approximation:
2000 +/- 44.7209

One can see that by tweaking the correlation, the uncertainty blends between the case where you assume the channels to have no correlation to 100 % correlation.

// script.C

TMatrixDSym corrToCov(const TMatrixDSym &corr, const std::vector<double> &var)
{
   int n = corr.GetNrows();
   TMatrixDSym cov(n);

   for (int i = 0; i < n; ++i) {
      for (int j = 0; j <= i; ++j) {
         double value = corr(i, j) * std::sqrt(var[j] * var[i]);
         cov(i, j) = cov(j, i) = value;
      }
   }

   return cov;
}

void createTemplateModel(RooWorkspace &ws, std::vector<double> vals, int index)
{
   auto name = [&](std::string s) { return s + std::to_string(index); };

   RooRealVar x{name("x").c_str(), "", 0., 0., 2.};
   x.setBins(2);

   TH1F h(name("h").c_str(), "", 2, 0.0, 2.0);
   h.SetBinContent(1, vals[0]);
   h.SetBinContent(2, vals[1]);

   // Convert TH1 to a simple template model
   RooDataHist dh(name("dh").c_str(), "", x, &h);
   RooHistPdf histPdf(name("histPdf").c_str(), "", x, dh);
   double n = h.Integral();
   RooRealVar nHist("nHist", "", n, 0.0, 10. * n);
   RooExtendPdf model(name("model").c_str(), "", histPdf, nHist);

   // Let's go Asimov for the dataset
   RooDataHist data{name("data").c_str(), "", x, RooFit::Import(h)};

   ws.import(model, RooFit::Silence());
   ws.import(data, RooFit::Silence());
}

void createCombinedTemplateModel(RooWorkspace &ws)
{
   using namespace RooFit;
   createTemplateModel(ws, {500, 1500}, 0);
   createTemplateModel(ws, {500, 1500}, 1);

   RooCategory sample("sample", "sample", {{"cat0", 0}, {"cat1", 1}});

   RooSimultaneous simPdf("simPdf", "simultaneous pdf", sample);
   simPdf.addPdf(*ws.pdf("model0"), "cat0");
   simPdf.addPdf(*ws.pdf("model1"), "cat1");
   ws.import(simPdf, Silence());

   RooArgSet params{*ws.var("nHist")};
   RooArgSet observables{*ws.var("x0"), *ws.var("x1")};

   ws.saveSnapshot("parameters", params);
   ws.saveSnapshot("observables", observables);

   // Construct combined dataset in (x,sample)
   std::map<std::string, RooDataHist *> dataMap{{"cat0", static_cast<RooDataHist *>(ws.data("data0"))},
                                                {"cat1", static_cast<RooDataHist *>(ws.data("data1"))}};
   RooDataSet combData("combData", "", observables, Index(sample), Import(dataMap));
   ws.import(combData, Silence());
}

void createCombinedCorrelatedModel(RooWorkspace &ws, double correlation)
{

   RooRealVar x1("x1", "x1", 500, 0, 10000);
   RooRealVar x2("x2", "x2", 1500, 0, 10000);
   RooRealVar x3("x3", "x3", 500, 0, 10000);
   RooRealVar x4("x4", "x4", 1500, 0, 10000);

   RooArgSet xset(x1, x2, x3, x4);

   RooRealVar nHist("nHist", "nHist", 2000, 0, 20000);

   RooFormulaVar m1("m1", "nHist * 1./4", nHist);
   RooFormulaVar m2("m2", "nHist * 3./4", nHist);
   RooFormulaVar m3("m3", "nHist * 1./4", nHist);
   RooFormulaVar m4("m4", "nHist * 3./4", nHist);

   RooArgList meanList(m1, m2, m3, m4);

   TMatrixDSym corr(4);
   corr(0, 0) = 1.;
   corr(1, 1) = 1.;
   corr(2, 2) = 1.;
   corr(3, 3) = 1.;
   corr(0, 2) = correlation;
   corr(2, 0) = correlation;
   corr(1, 3) = correlation;
   corr(3, 1) = correlation;

   RooMultiVarGaussian mvGauss("mvGauss", "", xset, meanList, corrToCov(corr, {500, 1500, 500, 1500}));
   RooDataSet data{"data", "", xset};
   data.add(xset);

   ws.import(mvGauss);
   ws.import(data);
   ws.saveSnapshot("parameters", *ws.var("nHist"));
}

void chi2fit(RooWorkspace &ws, std::string const &modelName, std::string const &dataName)
{
   using namespace RooFit;
   ws.pdf(modelName)->chi2FitTo(static_cast<RooDataHist&>(*ws.data(dataName)), PrintLevel(-1));
   std::cout << ws.var("nHist")->getVal() << " +/- " << ws.var("nHist")->getError() << std::endl;
   ws.loadSnapshot("parameters");
}

void fit(RooWorkspace &ws, std::string const &modelName, std::string const &dataName)
{
   using namespace RooFit;
   ws.pdf(modelName)->fitTo(static_cast<RooDataHist&>(*ws.data(dataName)), PrintLevel(-1));
   std::cout << ws.var("nHist")->getVal() << " +/- " << ws.var("nHist")->getError() << std::endl;
   ws.loadSnapshot("parameters");
}

void script()
{
   auto changeMsgLvl = std::make_unique<RooHelpers::LocalChangeMsgLevel>(RooFit::WARNING);
   {

      RooWorkspace ws{"ws"};
      createCombinedTemplateModel(ws);

      std::cout << "Fitting single channel:" << std::endl;
      chi2fit(ws, "model1", "data1");
      std::cout << "Fitting both channels with RooSimultaneous:" << std::endl;
      chi2fit(ws, "simPdf", "combData");

      std::cout << std::endl;
   }
   {
      RooWorkspace ws{"ws"};
      // zero correlation, equivalent to simultaneous template fit
      createCombinedCorrelatedModel(ws, /*correlation=*/0.0);

      std::cout << "Fitting uncorrelated Gaussian approximation:" << std::endl;
      fit(ws, "mvGauss", "data");

      std::cout << std::endl;
   }
   {
      RooWorkspace ws{"ws"};
      createCombinedCorrelatedModel(ws, /*correlation=*/0.5);

      std::cout << "Fitting 50 \% correlated Gaussian approximation:" << std::endl;
      fit(ws, "mvGauss", "data");

      std::cout << std::endl;
   }
   {
      RooWorkspace ws{"ws"};
      // almost fully correlated to avoid singular matrix
      createCombinedCorrelatedModel(ws, /*correlation=*/1. - 1e-6);

      std::cout << "Fitting 100 \% correlated Gaussian approximation:" << std::endl;
      fit(ws, "mvGauss", "data");

      std::cout << std::endl;
   }
}

guitargeek avatar Nov 27 '25 21:11 guitargeek

@TomasDado, what still needs to be decided before implementing support for this in HistFactory is the interface. How would you expect to pass the dataset with the covariance matrix?

Starting from the usual interface, where simPdf is the model and combData the combined RooDataSet:

std::unique_ptr<RooAbsReal> nll{simPdf->createNLL(*combData)}

Where does the convariance matrix enter? I guess it also needs to be part of the RooWorkspace, so maybe a new datamember of the RooAbsData? I'm curious to hear your thoughts!

In which form do you have the covariance at hand to begin with, by the way?

guitargeek avatar Nov 27 '25 21:11 guitargeek

Thanks for opening the issue and for the example script! Indeed, this is exactly what I had in mind. Regarding the interface, the model should include the correlatation/covariance matrix so yes it should be part of the RooWorkspace.

What I was thinking about, in terms of HistFactory, is that we already have the "Channel" object in the Measurement object. We could pass a flag teliing the code that a given Channel can switch to a Gaussian approximation (BTW this could be useful even on its own due to minimisation speed?) and then pass the covariance/correlation matrix. For the matrix itself, I think there are two options, either pass it pair-wise, e.g. Region A + Region B, then Region A + Region C, Region B + Region C... or pass the full correlation/covariance for all regions with the gaussian approximation. The latter approach is maybe easier?

For the Format, I think passing TMatrixDSym object should not be a problem.

TomasDado avatar Nov 28 '25 08:11 TomasDado

Hi, thanks for your input! So do I understand correctly that you want the correlation to be part of the Measurement object / model? I thought it needs to be part of the dataset (observed or Asimov), because the correlations are a function of the data?

Let me ask: how do you pan co compute the covariance matrix? That should clear things up.

guitargeek avatar Nov 28 '25 08:11 guitargeek

Hmm that is a fair point. Indeed, we often estimate the correlation by bootstrapping data (of coruse out simulation should hopefully give something similar but only data have the "real" answer). I do think it would make sense to have the correlations as part of the Measurement object as it will define the model, but tbh we should probably ask some real statistics expert.

TomasDado avatar Nov 28 '25 08:11 TomasDado

There are a few cases that come to mind:

  1. Covariance matrix fully parameterized as a function of model parameters: I think this would in practice imply using simulation for the estimate and having to either parameterize or fully recalculate the covariance matrix at each step of a fit. This might not be very simple in practice, either having to find a good parameterization / interpolations, or to keep all needed information around and re-evaluate the covariance matrix frequently.
  2. Constant covariance matrix: This is an approximation which should simplify the implementation a lot and is probably well motivated in some practical applications. I think by default it would be obtained from simulation, but as a second stage of simplification one could also estimate it from data. I would view this as a trick and conceptually this still feels like a model property to me, with the user handing over a pre-evaluated covariance matrix (and how they got it, simulation or data, is up to them).
  3. Parameterizing the covariance matrix and fitting it in situ: I am not sure about this one. What I have in mind would be parameterizing the covariance matrix in the model, so you either fully determine it from data or you allow some variations similarly to a constraint term with auxiliary data. I think this also means that data needs to be something like a vector of one-hot-encoded vectors to specify for each event which bins it lands in. That part for sure would belong to the dataset. The part that is less clear to me here is where the rest of the information goes. For the usual constraint terms in HistFactory, the auxiliary data conceptually belongs to the data and it captures the outcome of some (potentially fictitious) auxiliary measurement. If we phrase the nominal covariance estimate in that context, it suddenly feels like this should be auxiliary data too. That is different from how cases 1 and 2 here feel to me, where I would instead say this is a model property. Maybe the difference lies in whether the covariance matrix is constant or not and that is enough to warrant a conceptual difference in treatment, or perhaps my view is missing another aspect.

Thinking some more about case 3: perhaps the traditional equivalent is between having a nuisance parameter for some uncertainty (which can spawn associated auxiliary data) and deciding to hold that parameter fixed at a specific value (e.g. nominal), in which case the auxiliary data no longer matters and disappears. The information has shifted from "data" to the model (or rather model configuration).

alexander-held avatar Nov 28 '25 12:11 alexander-held

This looks like a very interesting idea!

I thought about something similar before for my analysis (before I gave up on using histograms altogether and went directly to classic "optimal observables").

I have not yet digested all of the discussion above and will just add what I thought about back then in case it is helpful for anyone. Apologies, it might be a bit chaotic... :)

I wanted to perform a template fit of EFT parameters (as e.g. TRExFitter can do) but not in 1D but on a 5D histogram (not LHC but e+e- -> WW, for which there is a basically a complete 5D basis of phase-space using production and decay angles). The statistics (in data) would of course not be good enough to do it in 5D but plain 1D does also not work very well as the correlation between the angular observables carries important information. However, I figured I would be able to simulate enough events to have a "good" 5D template to compare my 1D data histograms to.

At that point I did not have the idea to solve things with one (big) multivariate Gaussian but to proceed with the 5 1D histograms as if they were different channels and then to also add very many multivariate Gaussians as additional constraints to make the necessary connections between them. Going back to just 2D the idea was simple: each bin in 1D histogram A is connected to each bin in 1D histogram B by a 2D multivariate Gaussian and its off-diagonal entry in the covariance matrix is a function of the value of the bin in the "original" 2D histogram that is shared by the 2 bins in A and B. The 2D histogram would be determined from simulation (assuming it is possible to simulate enough events to have negligible statistical error on the 2D template). As it was supposed to be an EFT fit, the 2D histogram entries would be a quadratic parametrisation of the EFT parameters and all the individual covariance matrices would also have to be parametrised. I was convinced that this would also be easily extendable to n-dimensions (I am not so sure anymore) and I think overall it should be possible to do the same with one "big" covariance matrix as suggested here. In this case, we are in case 1 from above and the covariance matrix would be a part of the model.

I mainly gave up because there was no RooMultiVarGaussian that takes a parametrised covariance matrix and because I did not have the insight to just use one big Gaussian so it would have been absolutely horrible to implement.

Zehvogel avatar Dec 04 '25 17:12 Zehvogel