linkml icon indicating copy to clipboard operation
linkml copied to clipboard

`gen-rdf' cannot use local schema import

Open druimalban opened this issue 2 years ago • 3 comments

Describe the bug I have a data schema, which is a work in progress. The idea is that folks submit data + a data schema not unlike this one, which makes use of types from our parent data model.

One example is as follows: https://github.com/wwaites/saved_fisdat/blob/main/examples/sentinel_cages/sentinel_cages_sampling.yaml

Our WIP data model is as follows: https://github.com/saved-models/data-model

To reproduce

Run:

gen-rdf sentinel_cages_sampling.yaml

This produces the following error:

urllib.error.HTTPError: HTTP Error 404: Not Found

It seems to be looking for this exact path, and not finding it. In contrast, other generators seem to append .yaml, and indeed I can use this exact file to generate HTML schema with gen-project, gen-doc, mkdocs and friends.

Appending .yaml to this seems to make it find the file, but produces the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/var/db/scratch/linkml/data-model/src/model/core.yaml.yaml'

Adding the --verbose flag only produces one more message:

INFO:root:Default_range not specified. Default set to 'string'
urllib.error.HTTPError: HTTP Error 404: Not Found

The stack-trace, produced by adding the --stacktrace flag, is as follows. My home directory on this workstation is /var/db/scratch and I made a virtualenv in directory scratch, which is a sub-directory of my CWD which is ~/linkml. It seems largely unrevealing:

Traceback (most recent call last):
  File "/var/db/scratch/linkml/scratch/bin/gen-rdf", line 33, in <module>
    sys.exit(load_entry_point('linkml', 'console_scripts', 'gen-rdf')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/linkml/generators/rdfgen.py", line 82, in cli
    print(RDFGenerator(yamlfile, **kwargs).serialize(**kwargs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/linkml/utils/generator.py", line 307, in serialize
    self.end_schema(**kwargs)
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/linkml/generators/rdfgen.py", line 56, in end_schema
    graph.parse(
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/rdflib-7.0.0-py3.11.egg/rdflib/graph.py", line 1492, in parse
    parser.parse(source, self, **args)
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/rdflib-7.0.0-py3.11.egg/rdflib/plugins/parsers/jsonld.py", line 119, in parse
    to_rdf(data, conj_sink, base, context_data, version, generalized_rdf)
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/rdflib-7.0.0-py3.11.egg/rdflib/plugins/parsers/jsonld.py", line 138, in to_rdf
    return parser.parse(data, context, dataset)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/rdflib-7.0.0-py3.11.egg/rdflib/plugins/parsers/jsonld.py", line 160, in parse
    context.load(local_context, context.base)
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/rdflib-7.0.0-py3.11.egg/rdflib/plugins/shared/jsonld/context.py", line 401, in load
    self._prep_sources(base, source, sources, referenced_contexts)
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/rdflib-7.0.0-py3.11.egg/rdflib/plugins/shared/jsonld/context.py", line 430, in _prep_sources
    new_ctx = self._fetch_context(
              ^^^^^^^^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/rdflib-7.0.0-py3.11.egg/rdflib/plugins/shared/jsonld/context.py", line 472, in _fetch_context
    source = source_to_json(source_url)  # type: ignore[assignment]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/rdflib-7.0.0-py3.11.egg/rdflib/plugins/shared/jsonld/util.py", line 44, in source_to_json
    source = create_input_source(source, format="json-ld")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/rdflib-7.0.0-py3.11.egg/rdflib/parser.py", line 401, in create_input_source
    ) = _create_input_source_from_location(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/rdflib-7.0.0-py3.11.egg/rdflib/parser.py", line 463, in _create_input_source_from_location
    input_source = URLInputSource(absolute_location, format)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/rdflib-7.0.0-py3.11.egg/rdflib/parser.py", line 270, in __init__
    response: addinfourl = _urlopen(req)
                           ^^^^^^^^^^^^^
  File "/var/db/scratch/linkml/scratch/lib/python3.11/site-packages/rdflib-7.0.0-py3.11.egg/rdflib/_networking.py", line 106, in _urlopen
    return urlopen(request)
           ^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 525, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 634, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 563, in error
    return self._call_chain(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

Expected behavior RDF should be generated, or, somewhat more revealing failure messages.

About your computer (if applicable, please complete the following information):

  • OS: Mac OS X Sonoma 14.4 Darwin darwin 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:51:37 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T8112 arm64
  • Python 3.11 from MacPorts

druimalban avatar Mar 25 '24 22:03 druimalban

am not maintainer, but am interested in using the RDF generator for local schema generation in a way that doesn't require absolute URIs, so taking a loo -

setting a debugger, it looks like the URL that's throwing the error is 'http://localhost/data-model/src/model/core.context.jsonld'

that seems like it's being transformed from the JSON-LD @context by rdflib:

"@context": [
    "https://w3id.org/linkml/meta.context.jsonld",
    "https://w3id.org/linkml/types.context.jsonld",
    "../../data-model/src/model/core.context.jsonld",
    {
      "@base": "http://localhost/marinescot/"
    }
  ]

which is joining the @base and the relative path here: https://github.com/RDFLib/rdflib/blob/f792ad5aa92faefa7de8f8d07076eb670eba2b83/rdflib/plugins/shared/jsonld/context.py#L469

this does seem like one of those tricky cases where "RDF requires everything to be absolute URIs," but it seems like it should still be possible to make an RDF version of the schema even if one is just using placeholder localhost URIs.

If i change the id to just being a string like marinescot/sentinel_cages/sampling I get a little bit closer, except the same place urljoin()s my non-URI prefix marinescot with ../../data-model/src/model/core to get data-model/src/model/core.context.jsonld , so we lose the relative directory (and the core.context.jsonld doesn't exist anyway).

I'm not quite sure what i would expect the behavior to be, but it does seem like we need to be wrapping this behavior a bit more on the linkml side. it seems like there should a) be a way to tell rdflib to not try and resolve the URIs and just to treat them like strings, or b) we need to ensure that we have generated all the relevant contexts and that they are in the expected directories, and then pass in absolute file:// URIs whenever relative path imports are used.

sneakers-the-rat avatar Mar 25 '24 22:03 sneakers-the-rat

Thanks for that. However, I am not sure it is wholly pertinent to my issue with the CLI program's behaviour. E.g., I could call pdb in the gen-rdf script source in my virtualenv, but this does not seem ideal, given the issue I have is with the command line program, not the library or source.

My use-case for calling gen-rdf directly is so I don't have to write a Python wrapper for something I only intend to call very occasionally, and as part of the build/generation process for the project, which I call with a makefile.

I since hosted all of these files on a web server, so the issues with local imports shouldn't be extant, but the same error occurs, and it is as difficult as before to work out what is going on.

Neither the stacktrace nor the log levels flags appear to do actually anything as regards to observing what the gen-rdf tool prints to stdout. It isn't clear to me why they are even accessible from the CLI at all, since as far as I understand, they speak to an existing python logger which isn't intialised in a call to the CLI tool.

I think this is the crux of my issue, and something which is fixable without changing behaviour of rdflib and local URIs. At the moment it is very difficult to see what, if at all, is going on as regards to fetching remote resources.

druimalban avatar Apr 08 '24 15:04 druimalban

Oh for sure, better logging would go a long way here, thats a good point.

I think they would be the same locally and on web server - the core issue (I believe) being that for relative imports, the generator expects a .context.json file to have been generated for that schema. Since the model is set to use localhost it attempts to resolve it via http, so youd need to both a) have generated the imported json-ld contexts and b) be serving them from some http server at that address ( http://localhost/data-model/src/model/core.context.jsonld ).

So a) is a bug, I think, with the generator -

  1. it shouldnt need to resolve contexts at generation time like that in the first place, thats an RDFlib thing - because that makes it impossible to generate multipart schemas without the other parts already existing. I.e. even if the ID was a URL that you were intending to host the schema at, you couldnt generate the schema unless it already existed! Confirming the imported contexts in the json-ld should be a validation thing, not a generation thing.
  2. the generator should either generate the required json-ld contexts, or be able to "roll them down" s.t. the imported terms are generated within the importing schema
  3. the generator should be more aware of relative and local imports so that it can correctly wrap them to avoid all the above - either by defaulting to "rolling contexts down" or transforming them to a local path.

b) I think comes from the need for clarity around what the "id" field does - im not sure if you were planning to host these from a local http server, but there is a bit if ambiguity in the definition. id is formally required to be a URI, which requires a scheme at least, but for schemas that dont intend to be used like traditional linked data schemas that requirement is not all that relevant. In this case it might be a bit surprising the way it interacts with the import and generated schema.

So yes, we need better error messages and logging here so its easier for you to see what the problem is, and there are some bugs or at least unintuitive behaviors in the rdf generator

sneakers-the-rat avatar Apr 08 '24 16:04 sneakers-the-rat