shapefactor initial values missing in `json2xml` output
Summary
When trying to pass the XML/root representation of a pyhf workspace to a colleague to allow them to debug it using their preferred ROOT-based framework,
I found that pyhf json2xml does not preserve the measurement.config.parameters associated to shapefactor modifiers.
The ROOT and HF DTD lead me think this should be possible, either by somehow populating the initialShape histo, or maybe writing some ParamSettings elements with corresponding Val elements?
It would be great to fix this, such that non-default shapefactor initialisations are preserved in the XML representation.
OS / Environment
$cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="http://cern.ch/linux/"
BUG_REPORT_URL="http://cern.ch/linux/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
Steps to Reproduce
$pyhf json2xml --output-dir mfe_HF mfe.json
pyhf xml2json mfe_HF/FitConfig.xml > mfe_returns.json
# hashes not equal:
pyhf digest mfe.json # sha256:d28009b26fe2ce1fd4b3e55b3aa6911a59058d64485395a1bb0bee49384cdc41
pyhf digest mfe_returns.json # sha256:be205b4f5d312b9f2887072eee2260fb05961bf9ba0fef25acbb95e454e74159
# inspect revels parameter config changed (no roundtrip):
pyhf inspect mfe.json
pyhf inspect mfe_returns.json
File Upload (optional)
Expected Results
I expect the workspaces to round-trip consistently, or at least for the initial shapefactor values to be preserved.
Actual Results
# cat mfe_returns.json
{
"channels": [
{
"name": "channel_0",
"samples": [
{
"data": [
0.0,
1.0,
2.0,
3.0,
4.0
],
"modifiers": [
{
"data": null,
"name": "shape_factor_0",
"type": "shapefactor"
}
],
"name": "sample_0"
},
{
"data": [
1.0,
1.0,
1.0,
1.0,
1.0
],
"modifiers": [
{
"data": null,
"name": "poi",
"type": "normfactor"
}
],
"name": "sample_1"
}
]
},
{
"name": "channel_1",
"samples": [
{
"data": [
10.0,
10.0,
10.0,
10.0,
10.0
],
"modifiers": [
{
"data": null,
"name": "shape_factor_1",
"type": "shapefactor"
}
],
"name": "sample_2"
}
]
}
],
"measurements": [
{
"config": {
"parameters": [
{
"auxdata": [
1.0
],
"bounds": [
[
1.0,
1.0
]
],
"inits": [
1.0
],
"name": "lumi",
"sigmas": [
0.0
]
},
{
"bounds": [
[
0.0,
10.0
]
],
"inits": [
1.0
],
"name": "poi"
}
],
"poi": "poi"
},
"name": "measurement_0"
}
],
"observations": [
{
"data": [
2.0,
6.0,
8.0,
8.0,
6.0
],
"name": "channel_0"
},
{
"data": [
50.0,
40.0,
30.0,
20.0,
10.0
],
"name": "channel_1"
}
],
"version": "1.0.0"
}
pyhf Version
$pyhf --version
pyhf, version 0.7.0
Code of Conduct
- [X] I agree to follow the Code of Conduct
### Tasks
I'm now digging into this, but I believe even the XML spec doesn't allow for this. Do you have an actual example of the XML spec? For example, ShapeFactor has this information in the spec (https://github.com/scikit-hep/pyhf/blob/main/src/pyhf/schemas/HistFactorySchema.dtd#L155-L159) so we can't store information there.
If you look at the Measurement part (https://github.com/scikit-hep/pyhf/blob/main/src/pyhf/schemas/HistFactorySchema.dtd#L22-L39), you see we can add in some nested elements: <POI/>, <ParamSetting/>, and <ConstraintTerm/> . Particularly, ParamSetting is the most likely place to put this information
https://github.com/scikit-hep/pyhf/blob/fb604e89c274aabfc3435e2e0efd504b549b51f3/src/pyhf/schemas/HistFactorySchema.dtd#L45-L52
and I think generally, we'll need to make it so that we write out the Init values if they're not defaults (or always write them out regardless). However, I don't know how this works for parameters with more than one NP -- @alexander-held do you have such an example? Otherwise, I can try to figure out a way to make one in ROOT and export it out and see what the XML looks like.
I don't have an example unfortunately, from a quick test with a higher level library I did not see the init settings get written to the xml at all for shapefactor modifiers.
So I went ahead and went down the rabbit hole here. I don't know what ROOT is doing, and I had a hard time passing in a TH1 for initial shape manually. Here's what I did.
pyhf json2xml mfe.json(for the attached file in the issue) which gives meconfig/anddata/data.root. This latter root file is important.- Create inits histogram and add it into this file
makeInits.py - Create a
Measurementpointing at thisdata/data.rootfile and callPrintXMLon that measurement.
In fact, doing this gives me an invalid HistFactory XML (very confusingly) and I think it's probably a bug in the C++ implementation on the ROOT side (try running hist2workspace meas.xml on the resulting top-level to see the error).
Code/files
makeInits.py
import ROOT
InputFile = "data/data.root"
fp = ROOT.TFile(InputFile, "UPDATE")
inits_sf_0 = ROOT.TH1F("inits_shape_factor_0", "inits_shape_factor_0", 5, 0, 5)
inits_sf_0.SetDirectory(fp)
inits_sf_0.Fill(0, 5)
inits_sf_0.Fill(1, 4)
inits_sf_0.Fill(2, 3)
inits_sf_0.Fill(3, 2)
inits_sf_0.Fill(4, 1)
inits_sf_0.Write()
fp.Close()
makeXML.py
import ROOT
from ROOT.RooStats import HistFactory as HiFa
Measurement = HiFa.Measurement
Sample = HiFa.Sample
Channel = HiFa.Channel
InputFile = "data/data.root"
# Create the measurement
meas = Measurement("meas", "meas")
meas.SetOutputFilePrefix( "./results/example_UsingPython" )
meas.SetPOI( "poi" )
meas.AddConstantParam("alpha_syst1")
meas.AddConstantParam("Lumi")
meas.SetLumi( 1.0 )
meas.SetLumiRelErr( 0.10 )
meas.SetExportOnly( False )
meas.SetBinHigh( 2 )
chan0 = Channel("channel_0")
chan0.SetData("histchannel_0_data", InputFile)
chan0.SetStatErrorConfig( 0.05, "Poisson" )
sample_0 = Sample("sample_0", "histchannel_0_sample_0", InputFile)
sample_0.AddShapeFactor("shape_factor_0")
sample_0.GetShapeFactorList()[0].SetInputFile(InputFile)
sample_0.GetShapeFactorList()[0].SetHistoName("inits_shape_factor_0")
chan0.AddSample( sample_0 )
sample_1 = Sample("sample_1", "histchannel_0_sample_1", InputFile)
sample_1.AddNormFactor( "poi", 1, 0, 10 )
chan0.AddSample( sample_1 )
chan1 = Channel("channel_1")
chan1.SetData("histchannel_1_data", InputFile)
chan1.SetStatErrorConfig( 0.05, "Poisson" )
sample_2 = Sample("sample_2", "histchannel_1_sample_2", InputFile)
sample_2.AddShapeFactor("shape_factor_1")
chan1.AddSample( sample_2 )
# Add and build XML
meas.AddChannel( chan0 )
meas.AddChannel( chan1 )
meas.CollectHistograms()
meas.PrintTree()
meas.PrintXML()
Running all of this gave me the following XML dump
meas.xml
<!--
This xml file created automatically on:
2023-7-19
-->
<!DOCTYPE Combination SYSTEM 'HistFactorySchema.dtd'>
<Combination OutputFilePrefix="./results/example_UsingPython" >
<Input>./meas_channel_0.xml</Input>
<Input>./meas_channel_1.xml</Input>
<Measurement Name="meas" Lumi="1" LumiRelErr="0.1" ExportOnly="False" >
<POI>poi</POI>
<ParamSetting Const="True">alpha_syst1 Lumi</ParamSetting>
</Measurement>
</Combination>
meas_channel_0.xml
<!--
This xml file created automatically on:
2023-7-19
-->
<!DOCTYPE Channel SYSTEM 'HistFactorySchema.dtd'>
<Channel Name="channel_0" InputFile="" >
<Data HistoName="histchannel_0_data" InputFile="data/data.root" HistoPath="" />
<StatErrorConfig RelErrorThreshold="0.05" ConstraintType="Poisson" />
<Sample Name="sample_0" HistoPath="" HistoName="histchannel_0_sample_0" InputFile="data/data.root" NormalizeByTheory="True" >
<ShapeFactor Name="shape_factor_0" InputFile="data/data.root" HistoName="inits_shape_factor_0" HistoPath="" />
</Sample>
<Sample Name="sample_1" HistoPath="" HistoName="histchannel_0_sample_1" InputFile="data/data.root" NormalizeByTheory="True" >
<NormFactor Name="poi" Val="1" High="10" Low="0" Const="False" />
</Sample>
</Channel>
meas_channel_1.xml
<!--
This xml file created automatically on:
2023-7-19
-->
<!DOCTYPE Channel SYSTEM 'HistFactorySchema.dtd'>
<Channel Name="channel_1" InputFile="" >
<Data HistoName="histchannel_1_data" InputFile="data/data.root" HistoPath="" />
<StatErrorConfig RelErrorThreshold="0.05" ConstraintType="Poisson" />
<Sample Name="sample_2" HistoPath="" HistoName="histchannel_1_sample_2" InputFile="data/data.root" NormalizeByTheory="True" >
<ShapeFactor Name="shape_factor_1" />
</Sample>
</Channel>
My thought is we can at least ameliorate part of this situation by writing the ShapeFactor inits into a histogram in data/data.root when exporting to XML to at least make it easier for you to load it in with manually written code. That seems to be the only "missing" thing I thing we can do with pyhf, but it won't solve the roundtrip issue because the XML specification doesn't support serializing this information.