jsonschema icon indicating copy to clipboard operation
jsonschema copied to clipboard

RefResolver.resolve_fragment() breaks nested relative references

Open bartfeenstra opened this issue 6 years ago • 8 comments

First off, thanks for this great package! Depending on the outcome of this issue, I'm hoping to contribute back in a follow-up.

I am trying to validate the following JSON, available at http://127.0.0.1:5000/about/json/schema. Note: this JSON is itself a JSON Schema, and it contains the schema with which to validate itself (#/definitions/response/schema). This may be a little confusing.

{
   "definitions":{
      "request":{

      },
      "response":{
         "schema":{
            "oneOf":[
               {
                  "type":"object",
                  "properties":{
                     "errors":{
                        "type":"array",
                        "items":{
                           "$ref":"#/definitions/data/error"
                        }
                     }
                  },
                  "required":[
                     "errors"
                  ]
               },
               {
                  "$ref":"http://127.0.0.1:5000/about/json/external-schema/aHR0cDovL2pzb24tc2NoZW1hLm9yZy9kcmFmdC0wNC9zY2hlbWE%3D",
                  "description":"A JSON Schema."
               }
            ]
         },
         "openapi":{
            "$ref":"http://127.0.0.1:5000/about/json/external-schema/aHR0cDovL3N3YWdnZXIuaW8vdjIvc2NoZW1hLmpzb24%3D"
         }
      },
      "data":{
         "error":{
            "type":"object",
            "properties":{
               "code":{
                  "type":"string"
               },
               "title":{
                  "type":"string"
               }
            },
            "required":[
               "code",
               "title"
            ]
         }
      }
   },
   "$schema":"http://127.0.0.1:5000/about/json/schema#/definitions/response/schema"
}

I am performing this validation using the following Python code:

def validate(self, data, schema: Optional[Dict] = None):
    reference_resolver = RefResolver('', {})
    if schema is None:
        message = 'The JSON must be an object with a "schema" key.'
        if not isinstance(data, dict):
            raise ValueError('The JSON is not an object: %s' % message)
        if '$schema' not in data:
            raise KeyError('No "$schema" key found: %s' % message)
        _, schema = reference_resolver.resolve(data['$schema'])
    assert schema is not None
    validate(data, schema)

RefResolver.resolve() returns the following resolved schema. As you can see, the contained references remain intact, but their targets are no longer part of the resulting document (below), so when validating, the reference cannot be resolved, and validation fails.

{
   "oneOf":[
      {
         "type":"object",
         "properties":{
            "errors":{
               "type":"array",
               "items":{
                  "$ref":"#/definitions/data/error"
               }
            }
         },
         "required":[
            "errors"
         ]
      },
      {
         "$ref":"http://127.0.0.1:5000/about/json/external-schema/aHR0cDovL2pzb24tc2NoZW1hLm9yZy9kcmFmdC0wNC9zY2hlbWE%3D",
         "description":"A JSON Schema."
      }
   ]
}

I am not sure if the problem lies with my use of references, or if this is simply something for which no Python support has been added. I also went through the other issues about references and don't think I found a duplicate, but I'd love to know for sure if there is another thread with more information on this feature, if I am not accidentally doing something wrong myself :)

bartfeenstra avatar Dec 07 '17 19:12 bartfeenstra

Currently, the RefResolver doesn't handle $id keywords during reference resolution, or support dereferencing through $ref keywords. In theory this should be simple to solve:

  1. Upon encountering $id keywords in the document, push it to the current resolution scope, and undo upon resolution success. With the current API, calling with resolver.resolving(ref): would then push the $ref scope to the context during subsequent operations following.
  2. When indexing a JSON document by some index / key, look for $ref keywords, and follow them, effectively replacing the current resolution document and splicing the ref path with the contents of "$ref".

In practice this will probably need some cleaning up as the existing design is quite simple (and tries to separate parts of ref resolution that perhaps belong together)

agoose77 avatar Jul 12 '18 23:07 agoose77

@Julian @agoose77 @bartfeenstra Do you all have any fresh thoughts on this? We are using this with our generated schema that has definitions with$refs to other definitions that is causing the same break. We are able to remedy by inlining those $refs but this makes schemas unnecessarily large and difficult to read. I'm willing to take a stab at adding the functionality to resolve arbitrarily nested $refs but wanted to see if you all had ideas first.

tjb9dc avatar Oct 18 '18 03:10 tjb9dc

@Julian This seems to be fixed on master branch...when are you planning on cutting a release for 3.0.0?

tjb9dc avatar Oct 18 '18 13:10 tjb9dc

nvm I'm dumb, could just grab pre-release! Great work!! Love jsonSchema ❤️

tjb9dc avatar Oct 18 '18 13:10 tjb9dc

I could be wrong, but from glancing over the implementation, nothing has changed w.r.t ref handling & id scoping to fix this yet (as far as I can see).

agoose77 avatar Oct 18 '18 18:10 agoose77

Nothing should have changed in this area yeah, so if this is working now, would definitely like to hear it.

In general for these sorts of things I need to stress that it'd be super helpful for the reporter to work on minimizing their example.

There's a lot here that could be left out while still presenting the underlying issue. (I know it's been a long while since this was filed, so apologies for not giving that feedback sooner, but every additional unnecessary detail means more time I or someone else investigating would need to spend, which makes it less likely I'll actually have time to do so :)

Julian avatar Oct 21 '18 17:10 Julian

@Julian makes sense, I'll see if I can simplify and anonymize our schema and JSON payload into something very concise that fails on 2.6.0 and works on 3.0.0

tjb9dc avatar Oct 21 '18 18:10 tjb9dc

Here's an example of what I believe is the same issue.

instance:

{
  "multi_address": [
    {
      "address1": "123 Main",
      "city": "foo",
      "state": "AK",
      "zipcode": "12345"
    }
  ],
  "single_address": {
    "address1": "123 Main",
    "city": "foo",
    "state": "AK",
    "zipcode": "12345"
  }
}

schema:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/schema.json",
  "type": "object",
  "additionalProperties": false,
  "definitions": {
    "address": {
      "$id": "/definitions/address",
      "$schema": "https://json-schema.org/draft/2020-12/schema",
      "type": "object",
      "properties": {
        "address1": { "type": "string" },
        "address2": { "type": "string" },
        "city": { "type": "string" },
        "state": { "type": "string", "$ref": "#/definitions/state" },
        "zipcode": { "type": "string" }
      },
      "required": ["address1", "city", "state", "zipcode"]
    },
    "state": {
      "$id": "/definitions/state",
      "$schema": "https://json-schema.org/draft/2020-12/schema",
      "type": "string",
      "enum": [
        "AK",
        "AL",
        "AR",
        "AS",
        "AZ",
        "WY"
      ]
    }
  },
  "properties": {
    "multi_address": {
      "type": "array",
      "items": { "$ref": "#/definitions/address" }
    },
    "single_address": { "$ref": "#/definitions/address" }
  }
}

code:

#!/usr/bin/env python

import json
import sys
import pprint
from jsonschema.validators import validator_for
from jsonschema import FormatChecker, RefResolver

schema_file = sys.argv[1]
instance_file = sys.argv[2]

with open(schema_file) as fh:
    schema = json.load(fh)

with open(instance_file) as fh:
    instance = json.load(fh)

schema_store = {
    schema["$id"]: schema,
}

resolver = RefResolver.from_schema(schema, store=schema_store)
validator = validator_for(schema)(schema, format_checker=FormatChecker(), resolver=resolver)
validator.validate(instance)
errors = []
for err in validator.iter_errors(instance=instance):
    errors.append(err)

pprint.pprint(errors)

which raises:

jsonschema.exceptions.RefResolutionError: Unresolvable JSON pointer: 'definitions/state'

same error using the jsonschema cli:

$ jsonschema schema.json -i example.json

UPDATE:

Dereferencing the schema manually before passing it in to the validator succeeds:

#!/usr/bin/env python

import json
import sys
import pprint
import jsonref
from jsonschema.validators import validator_for
from jsonschema import FormatChecker

schema_file = sys.argv[1]
instance_file = sys.argv[2]

with open(schema_file) as fh:
    schema = json.load(fh)

with open(instance_file) as fh:
    instance = json.load(fh)

deref_schema = jsonref.loads(json.dumps(schema))

validator = validator_for(schema)(deref_schema, format_checker=FormatChecker())
validator.validate(instance)
errors = []
for err in validator.iter_errors(instance=instance):
    errors.append(err)

pprint.pprint(errors)

pkarman avatar Dec 14 '21 03:12 pkarman

I'm (slowly) trying to help minimize this and similar examples in issues. I'll get to the original example (hopefully in a few moments), but @pkarman, your example isn't a bug in jsonschema, your schema has a broken pointer. Specifically, you have (simplified):

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "definitions": {
    "address": {
      "$id": "/definitions/address",
      "properties": {"state": { "$ref": "#/definitions/state" }}
    },
    "state": {"$id": "/definitions/state"}
  },
  "properties": {"address": { "$ref": "#/definitions/address" }}
}

where address has an $id (i.e. is a separate document), and within it you have #/definitions/state, but address has no definitions, it's in the surrounding document so you cannot use a pointer with # to refer to it.

Side note: you can confirm what other implementations besides this one do with your schemas using a new tool I wrote called bowtie, with intro post here -- in this case, running:

bowtie validate -i python-jsonschema <(echo '{"$schema": "https://json-schema.org/draft/2020-12/schema", "definitions": {"address": {"$id": "/definitions/address", "properties": {"state": { "$ref": "#/definitions/state" }}}, "state": {"$id": "/definitions/state"}}, "properties": {"address": { "$ref": "#/definitions/address" }}}') <(echo '{"address": {"state": "AK"}}')`

and substituting python-jsonschema for other implementations like js-hyperjump shows mostly the same behavior as this one, though interestingly a few don't match and do something I haven't yet investigated with the broken ref.

Julian avatar Nov 23 '22 17:11 Julian

Hello there! Thanks a lot again for the kind words.

This, along with many many other $ref-related issues, is now finally being handled in #1049 with the introduction of a new referencing library which is fully compliant and has APIs which I hope are a lot easier to understand and customize.

The next release of jsonschema (v4.18.0) will contain a merged version of that PR, and should be released shortly in beta, and followed quickly by a regular release, assuming no critical issues are reported.

It looks from my testing like indeed this specific example works there! If you still care to, I'd love it if you tried out the beta once it is released, or certainly it'd be hugely helpful to immediately install the branch containing this work (https://github.com/python-jsonschema/jsonschema/tree/referencing) and confirm. You can in the interim find documentation for the change in a preview page here.

I'm going to close this given it indeed seems like it is addressed by #1049, but feel free to follow up with any comments. Sorry for the delay in getting to these, but hopefully this new release will bring lots of benefit!

Here's a quick pass at modifying what you're doing for that branch/the future release, in case it helps:

data = "<your big thing>"
from referencing import Registry, Resource
from referencing.jsonschema import DRAFT7
import jsonschema
resource = DRAFT7.create_resource(data)
registry = Registry().with_resources(
    [
        ("http://127.0.0.1:5000/about/json/schema", resource),
        ("http://127.0.0.1:5000/about/json/external-schema/aHR0cDovL2pzb24tc2NoZW1hLm9yZy9kcmFmdC0wNC9zY2hlbWE%3D", DRAFT7.create_resource({})),
        ("http://127.0.0.1:5000/about/json/external-schema/aHR0cDovL3N3YWdnZXIuaW8vdjIvc2NoZW1hLmpzb24%3D", DRAFT7.create_resource({})),
    ],
)
schema = registry.resolver().lookup(data['$schema']).contents
assert schema is not None
jsonschema.validate(schema=schema, instance=data, registry=registry)

Julian avatar Feb 23 '23 09:02 Julian