Schema validation crashes when running in an environment without internet access
Summary
In master and the 0.7.0 release candidate, pyhf operations involving model validation will crash in offline environments with a RefResolutionError. This is a common situation e.g. with worker nodes on HTC clusters. The bug was introduced after 0.6.3, I think in #1753 where the pre-loading was dropped.
OS / Environment
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="http://cern.ch/linux/"
BUG_REPORT_URL="http://cern.ch/linux/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
Steps to Reproduce
I don't know a good way to prepare the environment to demonstrate this. But the below test exposes the attempt by the RefResolver to resolve the schema id through the https URL, and fails against the release candidate/master, but passes in 0.6.3
from functools import partial
import pytest
import jsonschema
import pyhf
def make_asserting_handler(origin):
def asserting_handler(*args, **kwargs):
raise AssertionError(
f'called URL request handler from {origin} with args={args!r}, kwargs={kwargs!r} '
'when no call should have been needed'
)
return asserting_handler
@pytest.fixture
def no_http_jsonschema_ref_resolving(monkeypatch):
asserting_handler = make_asserting_handler('handlers')
handlers = {
'https': asserting_handler,
'http': asserting_handler,
}
WrappedResolver = partial(jsonschema.RefResolver, handlers=handlers)
monkeypatch.setattr('jsonschema.RefResolver', WrappedResolver, raising=True)
def test_preloaded_cache(
no_http_jsonschema_ref_resolving,
):
spec = {
'channels': [
{
'name': 'singlechannel',
'samples': [
{
'name': 'signal',
'data': [10],
'modifiers': [
{'name': 'mu', 'type': 'normfactor', 'data': None}
],
},
{
'name': 'background',
'data': [20],
'modifiers': [
{
'name': 'uncorr_bkguncrt',
'type': 'shapesys',
'data': [30],
}
],
},
],
}
]
}
try:
pyhf.schema.validate(spec, 'model.json')
except AttributeError:
pyhf.utils.validate(spec, 'model.json')
File Upload (optional)
No response
Expected Results
I expect schema validation to succeed without crashing even when there is no network access that allows resolving the https schema-ids.
Actual Results
jsonschema.exceptions.RefResolutionError: HTTPSConnectionPool(host='scikit-hep.org', port=443): Max retries exceeded with url: /pyhf/schemas/1.0.0/defs.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2b2bb8457c40>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
pyhf Version
pyhf, version 0.7.0rc2
Code of Conduct
- [X] I agree to follow the Code of Conduct
Related to this is a large increase in the execution time of pyhf.Model.__init__() due to the network request. For example, using the "Hello World" case from the README in v0.6.3, I get:
$ python -m timeit -s 'import pyhf' 'pyhf.simplemodels.uncorrelated_background(signal=[12.0, 11.0], bkg=[50.0, 52.0], bkg_uncertainty=[3.0, 7.0])'
50 loops, best of 5: 4.36 msec per loop
And for v0.7.0rc1:
$ python -m timeit -s 'import pyhf' 'pyhf.simplemodels.uncorrelated_background(signal=[12.0, 11.0], bkg=[50.0, 52.0], bkg_uncertainty=[3.0, 7.0])'
1 loop, best of 5: 218 msec per loop
For any code that builds many Models often (as some of mine does), this 50x slowdown can suddenly dominate the execution time.
For any code that builds many
Models often (as some of mine does), this 50x slowdown can suddenly dominate the execution time.
The network request should at least be cache'd for the next time the model is built. Your timeit is starting up a new session each time (which may or may not be what the user wants).
I expect schema validation to succeed without crashing even when there is no network access that allows resolving the https schema-ids.
Yeah, this is a regression. I thought I had this fixed because I remember mentioning it somewhere.
For any code that builds many
Models often (as some of mine does), this 50x slowdown can suddenly dominate the execution time.The network request should at least be cache'd for the next time the model is built. Your
timeitis starting up a new session each time (which may or may not be what the user wants).
It's not starting a new session:
python -m timeit -s 'import pyhf' -v 'pyhf.simplemodels.uncorrelated_background(signal=[12.0, 11.0], bkg=[50.0, 52.0], bkg_uncertainty=[3.0, 7.0])'
1 loop -> 0.491 secs
raw times: 209 msec, 232 msec, 242 msec, 229 msec, 218 msec
1 loop, best of 5: 209 msec per loop
You can see the first Timer.autorange() call takes 0.491 s, which is with a cold cache. The rest of the calls take ~0.2 s, which is basically just the HTTPS round trip time. Even with the file data cached, it's still 50x slower than before.
You can see the first
Timer.autorange()call takes 0.491 s, which is with a cold cache. The rest of the calls take ~0.2 s, which is basically just the HTTPS round trip time. Even with the file data cached, it's still 50x slower than before.
That's weird as the resolver shouldn't use https calls once cached... so there must be something else going on here?
Every time pyhf.schema.validate() is called, a new jsonschema.RefResolver is created. The default remote cache is a per-instance attribute of RefResolver, so the cache will never be fully utilized by pyhf in this configuration.
Edit: Ah, but this part hasn't changed since the last release. You're talking about the other cache (in RefResolver.store) that isn't working as intended...
Edit 2: Okay, the answer there is similar and still in that same code (https://github.com/python-jsonschema/jsonschema/blob/v4.7.1/jsonschema/validators.py#L696). RefResolver.__init__() only copies the entries from the store passed to it, so it can't propagate any data back to pyhf.schema.variables.SCHEMA_CACHE, so each time a RefResolver is created, it doesn't have the data that was cached by any previous RefResolvers.
Interesting, then store seems pretty useless on its own since I would have assumed it would pick through that first based on documentation
A mapping from URIs to documents to cache
So we'll have to refactor this based on the unexpected behavior in jsonschema.
EDIT: a naive solution would be to add the full URI for defs.json in the store, in addition to the local path, and have them both point to the same object in memory. This isn't the greatest, but I see the copy here.
EDIT: a naive solution would be to add the full URI for
defs.jsonin the store, in addition to the local path, and have them both point to the same object in memory. This isn't the greatest, but I see the copy here.
Sorry if I am missing your point, but is that not the behavior of v0.6.3? load_schema caches the schema under its "$id" which for 'defs.json' is "https://scikit-hep.org/pyhf/schemas/1.0.0/defs.json".