rdflib
rdflib copied to clipboard
ConjunctiveGraph doesn't handle parsing datasets with default graphs properly
When ConjunctiveGraph.parse is called, it wraps its underlying store in a regular Graph instance. This causes problems for parsers of datasets, e.g. NQuads, TriG and JSON-LD.
Specifically, the triples in the default graph of a dataset haphazardly end up in bnode-named contexts.
Example:
import sys
from rdflib import *
cg = ConjunctiveGraph()
cg.parse(format="nquads", data=u"""
<http://example.org/a> <http://example.org/ns#label> "A" .
<http://example.org/b> <http://example.org/ns#label> "B" <http://example.org/b/> .
""")
assert len(cg.default_context) == 1 # fails
While I've attempted to overcome this by using the underlying graph.store
in these parsers, they cannot access the default_context
of ConjunctiveGraph through this store. It is there in the underlying store, but its identifier is inaccessible to the parser without further changes to the parse method of ConjunctiveGraph.
This becomes tricky because the contract for ConjunctiveGraph:s parse method is:
Parse source adding the resulting triples to its own context
(sub graph of this graph).
See :meth:`rdflib.graph.Graph.parse` for documentation on arguments.
:Returns:
The graph into which the source was parsed. In the case of n3
it returns the root context.
I am not sure how we can change this behaviour, since client code may rely on this. We could either add a new method, e.g. parse_dataset
, or a flag. That would not be obvious to all users though, and somehow I would like to change the behaviour to handle datasets as well. It is always possible to get/create a named graph from a conjunctive graph and parse data into that.
I have gotten further by adding publicID=cg.default_context.identifier
to the parse invocation. This causes the TriG parser to behave properly (and it is easy to adapt the nquads parser to work from there on). But I am not sure if this is a wise solution to the problem.
I'll mull more on this given time, but it would be good to have more people consider a proper revision of the parsing mechanism for datasets.
This underlies the problems described in #432 and #433 (and is related #428).
(Obviously, this in turn causes the serializers for the same formats to emit unexpected bnode-named graphs when data has been read through these parsers.)
It might make sense that one should simply parse into the default_context
of a ConjunctiveGraph
or Dataset
, like:
cg = rdflib.ConjunctiveGraph()
cg.default_context.parse(data=data, format='trig')
print cg.serialize(format='trig')
By doing it like this (along with a bunch of fairly recent fixes on RDFLib master), this could be considered good enough. It doesn't seem intuitive though.
Leaving this open in case we want to redesign the parsing of datasets to make this more obvious.
hmm, so maybe the 6.0.0 label was wrong? can this go in 4.2.2 then (so no backwards incompatibility) and just be closed and re-opened if desired?
There would be no change by telling users to parse into default_context
, that just seems unintuitive.
I'd say leave this open (but for 5.0.0 maybe?) since it is about changing the parsing usage/behaviour when parsing dataset syntaxes (nquads, trig, json-ld and trix). The current wiring of graphs, contexts and underlying stores could really do with such an overhaul.
This issue is still a problem in RDFlib 6.0.2. The workaround of publicID=cg.default_context.identifier
does work but is indeed unintuitive.
We really do need to be able to say:
cg = Dataset()
cg.parse("some-quads-file.trig") # RDF file type worked out by guess_format()
... and then have the default_context == whatever the Trig file said the default graph was.
Fix is more or less ready, please have a look:
- https://github.com/RDFLib/rdflib/pull/2406