pyhf Schema validation crashes when running in an environment without internet access

Summary

In master and the 0.7.0 release candidate, pyhf operations involving model validation will crash in offline environments with a RefResolutionError. This is a common situation e.g. with worker nodes on HTC clusters. The bug was introduced after 0.6.3, I think in #1753 where the pre-loading was dropped.

OS / Environment

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="http://cern.ch/linux/"
BUG_REPORT_URL="http://cern.ch/linux/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Steps to Reproduce

I don't know a good way to prepare the environment to demonstrate this. But the below test exposes the attempt by the RefResolver to resolve the schema id through the https URL, and fails against the release candidate/master, but passes in 0.6.3

from functools import partial

import pytest
import jsonschema
import pyhf

def make_asserting_handler(origin):
    def asserting_handler(*args, **kwargs):
        raise AssertionError(
            f'called URL request handler from {origin} with args={args!r}, kwargs={kwargs!r} '
            'when no call should have been needed'
        )

    return asserting_handler


@pytest.fixture
def no_http_jsonschema_ref_resolving(monkeypatch):
    asserting_handler = make_asserting_handler('handlers')
    handlers = {
        'https': asserting_handler,
        'http': asserting_handler,
    }
    WrappedResolver = partial(jsonschema.RefResolver, handlers=handlers)
    monkeypatch.setattr('jsonschema.RefResolver', WrappedResolver, raising=True)

def test_preloaded_cache(
    no_http_jsonschema_ref_resolving,
):
    spec = {
        'channels': [
            {
                'name': 'singlechannel',
                'samples': [
                    {
                        'name': 'signal',
                        'data': [10],
                        'modifiers': [
                            {'name': 'mu', 'type': 'normfactor', 'data': None}
                        ],
                    },
                    {
                        'name': 'background',
                        'data': [20],
                        'modifiers': [
                            {
                                'name': 'uncorr_bkguncrt',
                                'type': 'shapesys',
                                'data': [30],
                            }
                        ],
                    },
                ],
            }
        ]
    }
    try:
        pyhf.schema.validate(spec, 'model.json')
    except AttributeError:
        pyhf.utils.validate(spec, 'model.json')

File Upload (optional)

No response

Expected Results

I expect schema validation to succeed without crashing even when there is no network access that allows resolving the https schema-ids.

Actual Results

jsonschema.exceptions.RefResolutionError: HTTPSConnectionPool(host='scikit-hep.org', port=443): Max retries exceeded with url: /pyhf/schemas/1.0.0/defs.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2b2bb8457c40>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

pyhf Version

pyhf, version 0.7.0rc2

Code of Conduct

[X] I agree to follow the Code of Conduct

Jul 06 '22 11:07 lhenkelm

Related to this is a large increase in the execution time of pyhf.Model.__init__() due to the network request. For example, using the "Hello World" case from the README in v0.6.3, I get:

$ python -m timeit -s 'import pyhf' 'pyhf.simplemodels.uncorrelated_background(signal=[12.0, 11.0], bkg=[50.0, 52.0], bkg_uncertainty=[3.0, 7.0])'
50 loops, best of 5: 4.36 msec per loop

And for v0.7.0rc1:

$ python -m timeit -s 'import pyhf' 'pyhf.simplemodels.uncorrelated_background(signal=[12.0, 11.0], bkg=[50.0, 52.0], bkg_uncertainty=[3.0, 7.0])'
1 loop, best of 5: 218 msec per loop

For any code that builds many Models often (as some of mine does), this 50x slowdown can suddenly dominate the execution time.

Jul 11 '22 16:07 masonproffitt

For any code that builds many Models often (as some of mine does), this 50x slowdown can suddenly dominate the execution time.

The network request should at least be cache'd for the next time the model is built. Your timeit is starting up a new session each time (which may or may not be what the user wants).

Jul 11 '22 16:07 kratsg

I expect schema validation to succeed without crashing even when there is no network access that allows resolving the https schema-ids.

Yeah, this is a regression. I thought I had this fixed because I remember mentioning it somewhere.

Jul 11 '22 16:07 kratsg

For any code that builds many Models often (as some of mine does), this 50x slowdown can suddenly dominate the execution time.

The network request should at least be cache'd for the next time the model is built. Your timeit is starting up a new session each time (which may or may not be what the user wants).

It's not starting a new session:

python -m timeit -s 'import pyhf' -v 'pyhf.simplemodels.uncorrelated_background(signal=[12.0, 11.0], bkg=[50.0, 52.0], bkg_uncertainty=[3.0, 7.0])'
1 loop -> 0.491 secs

raw times: 209 msec, 232 msec, 242 msec, 229 msec, 218 msec

1 loop, best of 5: 209 msec per loop

You can see the first Timer.autorange() call takes 0.491 s, which is with a cold cache. The rest of the calls take ~0.2 s, which is basically just the HTTPS round trip time. Even with the file data cached, it's still 50x slower than before.

Jul 11 '22 16:07 masonproffitt

You can see the first Timer.autorange() call takes 0.491 s, which is with a cold cache. The rest of the calls take ~0.2 s, which is basically just the HTTPS round trip time. Even with the file data cached, it's still 50x slower than before.

That's weird as the resolver shouldn't use https calls once cached... so there must be something else going on here?

Jul 11 '22 17:07 kratsg

Every time pyhf.schema.validate() is called, a new jsonschema.RefResolver is created. The default remote cache is a per-instance attribute of RefResolver, so the cache will never be fully utilized by pyhf in this configuration.

Edit: Ah, but this part hasn't changed since the last release. You're talking about the other cache (in RefResolver.store) that isn't working as intended...

Edit 2: Okay, the answer there is similar and still in that same code (https://github.com/python-jsonschema/jsonschema/blob/v4.7.1/jsonschema/validators.py#L696). RefResolver.__init__() only copies the entries from the store passed to it, so it can't propagate any data back to pyhf.schema.variables.SCHEMA_CACHE, so each time a RefResolver is created, it doesn't have the data that was cached by any previous RefResolvers.

Jul 11 '22 18:07 masonproffitt

Interesting, then store seems pretty useless on its own since I would have assumed it would pick through that first based on documentation

        A mapping from URIs to documents to cache

So we'll have to refactor this based on the unexpected behavior in jsonschema.

EDIT: a naive solution would be to add the full URI for defs.json in the store, in addition to the local path, and have them both point to the same object in memory. This isn't the greatest, but I see the copy here.

Jul 11 '22 19:07 kratsg

EDIT: a naive solution would be to add the full URI for defs.json in the store, in addition to the local path, and have them both point to the same object in memory. This isn't the greatest, but I see the copy here.

Sorry if I am missing your point, but is that not the behavior of v0.6.3? load_schema caches the schema under its "$id" which for 'defs.json' is "https://scikit-hep.org/pyhf/schemas/1.0.0/defs.json".

Jul 12 '22 10:07 lhenkelm