pyhf Large cls value difference between v0.6.3 and master

Summary

When I try to run

$ pyhf cls --test-poi 0.16 signal.json

(file provided to developers) with pyhf, the result depends on the version. For versions before https://github.com/scikit-hep/pyhf/pull/1639, the CLS_exp is about 50% lower than for versions on or after that commit. Testing revealed that this is likely due to the handling of staterrors, as pruning these removed the discrepancy.

OS / Environment

cat /etc/os-release
NAME="Scientific Linux"
VERSION="7.9 (Nitrogen)"
ID="scientific"
ID_LIKE="rhel centos fedora"
VERSION_ID="7.9"
PRETTY_NAME="Scientific Linux 7.9 (Nitrogen)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:scientificlinux:scientificlinux:7.9:GA"
HOME_URL="http://www.scientificlinux.org//"
BUG_REPORT_URL="mailto:[email protected]"

REDHAT_BUGZILLA_PRODUCT="Scientific Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.9
REDHAT_SUPPORT_PRODUCT="Scientific Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.9"

Steps to Reproduce

pip install --ignore-installed -U "git+https://github.com/scikit-hep/pyhf.git@9fbbbf9740175fcdfe73fedc4351953be18d2664"
pyhf cls --test-poi 0.16 signal.json

differs from

pip install --ignore-installed -U "git+https://github.com/scikit-hep/pyhf.git@5ea4e0a7751aa446c286b9697e01a9439622ebb5
pyhf cls --test-poi 0.16 signal.json

File Upload (optional)

pyhf devs sent internal file.

Expected Results

I expect the behavior to only change slightly between pyhf versions.

Actual Results

The behavior changes up to 50%.

pyhf Version

pyhf, version 0.6.4.dev33
pyhf, version 0.6.4.dev34

Code of Conduct

[X] I agree to follow the Code of Conduct

Dec 03 '21 17:12 mdhank

For this we're comparing to v0.6.3 and the current state of master.

@kratsg has been able to identify the issue with with staterror as pruning modifier types of staterror from the workspace in question removes the difference:

commit a8d34462:

$ pyhf prune -t staterror ws.json | pyhf cls --test-poi 0.16 --backend torch
{
    "CLs_exp": [
        0.0011805804778050168,
        0.0074845548029529455,
        0.04146623165427494,
        0.17762055416827052,
        0.4957891760138955
    ],
    "CLs_obs": 0.5105860394966024
}

master:

$ pyhf prune -t staterror ws.json | pyhf cls --test-poi 0.16 --backend torch
{
    "CLs_exp": [
        0.0011805782059676592,
        0.007484543587818926,
        0.04146618658379227,
        0.17762042937331468,
        0.49578899186050396
    ],
    "CLs_obs": 0.5105860277468133
}

and @lukasheinrich has been able to isolate the difference to these lines

https://github.com/scikit-hep/pyhf/blob/b798ef23c77db6b1d6d606f4b360a5f47fd63dd2/src/pyhf/modifiers/staterror.py#L116-L117

added in PR #1639. We'll work on resolving this soon.

Dec 03 '21 18:12 matthewfeickert

Is it know which behavior is correct? If the old is known to be fine, I can just run with that version.

Dec 03 '21 18:12 mdhank

Is it know which behavior is correct? If the old is known to be fine, I can just run with that version.

I can't say with certainty yet until we finish diagnosing things, but if I had to guess the v0.6.3 behavior is correct.

Dec 05 '21 00:12 matthewfeickert

Hi @mdhank - the prior behavoir is correct and we know where the bug is, just requires implementing a fix. I'll update here once we have one

Dec 05 '21 13:12 lukasheinrich

Given the comments above, I guess the source is already clear, but just in case here is a simpler reproducer:

import pyhf

spec = {
    "channels": [
        {
            "name": "channel",
            "samples": [
                {"data": [5.0], "modifiers": [], "name": "signal"},
                {
                    "data": [1.0],
                    "modifiers": [
                        {
                            "data": [1.5],  # value irrelevant
                            "name": "NP",
                            "type": "staterror",
                        },
                    ],
                    "name": "bkg",
                },
            ],
        }
    ],
    "measurements": [
        {
            "config": {
                "parameters": [],
                "poi": "",
            },
            "name": "meas",
        }
    ],
    "observations": [{"data": [0], "name": "channel"}],
    "version": "1.0.0",
}


inits = [2.0]  # staterror / shapesys value

# staterror version
model = pyhf.Workspace(spec).model()
print(f"staterror: {model.expected_data(inits, include_auxdata=False)}")

# shapesys version
spec["channels"][0]["samples"][1]["modifiers"][0]["type"] = "shapesys"
model = pyhf.Workspace(spec).model()
print(f"shapesys: {model.expected_data(inits, include_auxdata=False)}")

which with 9fbbbf97 will print

staterror: [12.]
shapesys: [7.]

and with 5ea4e0a7 will print

staterror: [7.]
shapesys: [7.]

Both modifier types are multiplicative, so the results should match. After the refactor in #1639, the staterror modifier mistakenly acts on both samples.

Mar 04 '22 17:03 alexander-held

Hi @mdhank - the prior behavoir is correct and we know where the bug is, just requires implementing a fix. I'll update here once we have one

The issue is understood to be localized around this line: https://github.com/scikit-hep/pyhf/blob/b798ef23c77db6b1d6d606f4b360a5f47fd63dd2/src/pyhf/modifiers/staterror.py#L116-L117

Mar 04 '22 17:03 kratsg