pyhf icon indicating copy to clipboard operation
pyhf copied to clipboard

fix: Populate the schema cache with local `defs.json` again

Open lhenkelm opened this issue 3 years ago • 7 comments

Description

Resolves #1916 by re-introducing the pre-population of the schema cache that had been done before the re-write of the schema utilities in #1753. Since defs.json is in the cache under its id, jsonschema.RefResolver will not attempt to connect to "https://scikit-hep.org" to find it, which can crash code using pyhf in environments where the resource is not reachable.

I also tried to include a test to avoid future regression, but I am not happy with it. A better/more robust test would be a dedicated run in a sandboxed (i.e. no internet access / relevant ports blocked) environment, e.g. as part of the CI. But I have no clue how to set that up.

Checklist Before Requesting Reviewer

  • [x] Tests are passing
  • [x] "WIP" removed from the title of the pull request
  • [ ] Selected an Assignee for the PR to be responsible for the log summary

Before Merging

For the PR Assignees:

  • [ ] Summarize commit messages into a comprehensive review of the PR
- load `defs.json` into `schema.variables.SCHEMA_CACHE` from the local file
  - in line with what was done before https://github.com/scikit-hep/pyhf/pull/1753
- a unit test to catch this regression in the future

lhenkelm avatar Jul 06 '22 12:07 lhenkelm

Codecov Report

Merging #1917 (f34bfb0) into master (e5bfdd0) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1917   +/-   ##
=======================================
  Coverage   98.26%   98.26%           
=======================================
  Files          69       69           
  Lines        4439     4443    +4     
  Branches      748      748           
=======================================
+ Hits         4362     4366    +4     
  Misses         45       45           
  Partials       32       32           
Flag Coverage Δ
contrib 26.04% <25.00%> (+0.22%) :arrow_up:
doctest 61.06% <100.00%> (+0.03%) :arrow_up:
unittests-3.10 96.15% <100.00%> (+<0.01%) :arrow_up:
unittests-3.7 96.13% <100.00%> (+<0.01%) :arrow_up:
unittests-3.8 96.17% <100.00%> (+<0.01%) :arrow_up:
unittests-3.9 96.19% <100.00%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/pyhf/schema/__init__.py 100.00% <100.00%> (ø)
src/pyhf/schema/loader.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Jul 06 '22 12:07 codecov[bot]

I'm trying to figure out or remind myself if this was removed to allow for a custom v1.0.0 to be loaded by the user and override the default defs.json that would have been shipped otherwise in this package. I guess I'm wondering if one should even allow this situation or not (probably not).

kratsg avatar Jul 11 '22 16:07 kratsg

For your tests, the easiest way is to monkeypatch any part of the necessary call to requests or urllib under the hood and force it to return an exception instead -- and you can also use mocker.spy to check if it's been called or not. Ideally, a local ref resolver shouldn't have to grab anything via internet connection.

kratsg avatar Jul 11 '22 16:07 kratsg

I'm trying to figure out or remind myself if this was removed to allow for a custom v1.0.0 to be loaded by the user and override the default defs.json that would have been shipped otherwise in this package. I guess I'm wondering if one should even allow this situation or not (probably not).

I don't remember an explicit discussion in that set of issues/PRs. I might have missed it though. You are ofc. correct that this creates a bit of a trap if one provides a custom defs.json, with version '1.0.0', and the same schema-id, then your cache and your set schema will disagree, quite confusingly.

I guess strictly speaking this breaks some assumptions of what the version and schema-id mean, if someine changes the schema they should change at least one of those too. But the simplest way of approaching a custom schema (copy the builtins and modify step by step) does lead straight into this trap :(.

Maybe we could add some logic to the schema-path setter that checks and warns/raises if some of the schemas found at the new path would cause this error? Because as long as the schema-id+version is unique to a definition, the cache should work fine.

lhenkelm avatar Jul 11 '22 16:07 lhenkelm

For your tests, the easiest way is to monkeypatch any part of the necessary call to requests or urllib under the hood and force it to return an exception instead -- and you can also use mocker.spy to check if it's been called or not. Ideally, a local ref resolver shouldn't have to grab anything via internet connection.

Yeah this is why the tests the PR is currently proposing to monkey-patch requests.get (the default w/o handlers) and urlopen (the fallback), according to the jsonschema implementation. Also deleting requests.sessions.Session.request and patching out socket.socket is just an attempt to anticipate any other workarounds. The sockets might be overkill though.

lhenkelm avatar Jul 11 '22 16:07 lhenkelm

@lhenkelm just to check, is this ready for review? I assume yes, but I just wanted to make sure that you didn't have anything that you wanted to come back and add before it got reviewed and went in.

matthewfeickert avatar Aug 12 '22 22:08 matthewfeickert

@matthewfeickert having thought some more about @kratsg's concern about users overriding the default schema using the default version and id, I would like to add a bit of code to at least clear the cache if the schema path/version are changed, to preempt these confusions.

A second step might be to also merge the jsonschema.RefResolver's cache into the pyhf cache at the end of validation, to allow custom schema users to benefit from the caching implicitly. But that can be second PR too (it goes beyond just a fix).

I am a bit busy ATM but I'll get on this this week and ping you once I am done.

lhenkelm avatar Aug 15 '22 08:08 lhenkelm

I wonder if we can simplify the PR via https://pypi.org/project/pytest-socket/ using the (socket_disabled) fixture.

kratsg avatar Aug 15 '22 20:08 kratsg

@matthewfeickert from my side this is ready to review. @kratsg socket_disabled works like a charm, thanks! If I drop the line pre-filling the cache, the test still catches the regression:

FAILED tests/test_schema.py::test_defs_always_cached - jsonschema.exceptions.RefResolutionError: A test tried to use socket.socket.

lhenkelm avatar Aug 16 '22 15:08 lhenkelm