fix: Populate the schema cache with local `defs.json` again
Description
Resolves #1916 by re-introducing the pre-population of the schema cache that had been done before the re-write of the schema utilities in #1753. Since defs.json is in the cache under its id, jsonschema.RefResolver will not attempt to connect to "https://scikit-hep.org" to find it, which can crash code using pyhf in environments where the resource is not reachable.
I also tried to include a test to avoid future regression, but I am not happy with it. A better/more robust test would be a dedicated run in a sandboxed (i.e. no internet access / relevant ports blocked) environment, e.g. as part of the CI. But I have no clue how to set that up.
Checklist Before Requesting Reviewer
- [x] Tests are passing
- [x] "WIP" removed from the title of the pull request
- [ ] Selected an Assignee for the PR to be responsible for the log summary
Before Merging
For the PR Assignees:
- [ ] Summarize commit messages into a comprehensive review of the PR
- load `defs.json` into `schema.variables.SCHEMA_CACHE` from the local file
- in line with what was done before https://github.com/scikit-hep/pyhf/pull/1753
- a unit test to catch this regression in the future
Codecov Report
Merging #1917 (f34bfb0) into master (e5bfdd0) will increase coverage by
0.00%. The diff coverage is100.00%.
@@ Coverage Diff @@
## master #1917 +/- ##
=======================================
Coverage 98.26% 98.26%
=======================================
Files 69 69
Lines 4439 4443 +4
Branches 748 748
=======================================
+ Hits 4362 4366 +4
Misses 45 45
Partials 32 32
| Flag | Coverage Δ | |
|---|---|---|
| contrib | 26.04% <25.00%> (+0.22%) |
:arrow_up: |
| doctest | 61.06% <100.00%> (+0.03%) |
:arrow_up: |
| unittests-3.10 | 96.15% <100.00%> (+<0.01%) |
:arrow_up: |
| unittests-3.7 | 96.13% <100.00%> (+<0.01%) |
:arrow_up: |
| unittests-3.8 | 96.17% <100.00%> (+<0.01%) |
:arrow_up: |
| unittests-3.9 | 96.19% <100.00%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/pyhf/schema/__init__.py | 100.00% <100.00%> (ø) |
|
| src/pyhf/schema/loader.py | 100.00% <100.00%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
I'm trying to figure out or remind myself if this was removed to allow for a custom v1.0.0 to be loaded by the user and override the default defs.json that would have been shipped otherwise in this package. I guess I'm wondering if one should even allow this situation or not (probably not).
For your tests, the easiest way is to monkeypatch any part of the necessary call to requests or urllib under the hood and force it to return an exception instead -- and you can also use mocker.spy to check if it's been called or not. Ideally, a local ref resolver shouldn't have to grab anything via internet connection.
I'm trying to figure out or remind myself if this was removed to allow for a custom
v1.0.0to be loaded by the user and override the defaultdefs.jsonthat would have been shipped otherwise in this package. I guess I'm wondering if one should even allow this situation or not (probably not).
I don't remember an explicit discussion in that set of issues/PRs. I might have missed it though.
You are ofc. correct that this creates a bit of a trap if one provides a custom defs.json, with version '1.0.0', and the same schema-id, then your cache and your set schema will disagree, quite confusingly.
I guess strictly speaking this breaks some assumptions of what the version and schema-id mean, if someine changes the schema they should change at least one of those too. But the simplest way of approaching a custom schema (copy the builtins and modify step by step) does lead straight into this trap :(.
Maybe we could add some logic to the schema-path setter that checks and warns/raises if some of the schemas found at the new path would cause this error? Because as long as the schema-id+version is unique to a definition, the cache should work fine.
For your tests, the easiest way is to monkeypatch any part of the necessary call to
requestsorurllibunder the hood and force it to return an exception instead -- and you can also usemocker.spyto check if it's been called or not. Ideally, a local ref resolver shouldn't have to grab anything via internet connection.
Yeah this is why the tests the PR is currently proposing to monkey-patch requests.get (the default w/o handlers) and urlopen (the fallback), according to the jsonschema implementation.
Also deleting requests.sessions.Session.request and patching out socket.socket is just an attempt to anticipate any other workarounds. The sockets might be overkill though.
@lhenkelm just to check, is this ready for review? I assume yes, but I just wanted to make sure that you didn't have anything that you wanted to come back and add before it got reviewed and went in.
@matthewfeickert having thought some more about @kratsg's concern about users overriding the default schema using the default version and id, I would like to add a bit of code to at least clear the cache if the schema path/version are changed, to preempt these confusions.
A second step might be to also merge the jsonschema.RefResolver's cache into the pyhf cache at the end of validation, to allow custom schema users to benefit from the caching implicitly. But that can be second PR too (it goes beyond just a fix).
I am a bit busy ATM but I'll get on this this week and ping you once I am done.
I wonder if we can simplify the PR via https://pypi.org/project/pytest-socket/ using the (socket_disabled) fixture.
@matthewfeickert from my side this is ready to review.
@kratsg socket_disabled works like a charm, thanks! If I drop the line pre-filling the cache, the test still catches the regression:
FAILED tests/test_schema.py::test_defs_always_cached - jsonschema.exceptions.RefResolutionError: A test tried to use socket.socket.