rdflib ConjunctiveGraph doesn't handle parsing datasets with default graphs properly

When ConjunctiveGraph.parse is called, it wraps its underlying store in a regular Graph instance. This causes problems for parsers of datasets, e.g. NQuads, TriG and JSON-LD.

Specifically, the triples in the default graph of a dataset haphazardly end up in bnode-named contexts.

Example:

import sys
from rdflib import *

cg = ConjunctiveGraph()
cg.parse(format="nquads", data=u"""
<http://example.org/a> <http://example.org/ns#label> "A" .
<http://example.org/b> <http://example.org/ns#label> "B" <http://example.org/b/> .
""")
assert len(cg.default_context) == 1 # fails

While I've attempted to overcome this by using the underlying graph.store in these parsers, they cannot access the default_context of ConjunctiveGraph through this store. It is there in the underlying store, but its identifier is inaccessible to the parser without further changes to the parse method of ConjunctiveGraph.

This becomes tricky because the contract for ConjunctiveGraph:s parse method is:

    Parse source adding the resulting triples to its own context
    (sub graph of this graph).

    See :meth:`rdflib.graph.Graph.parse` for documentation on arguments.

    :Returns:

    The graph into which the source was parsed. In the case of n3
    it returns the root context.

I am not sure how we can change this behaviour, since client code may rely on this. We could either add a new method, e.g. parse_dataset, or a flag. That would not be obvious to all users though, and somehow I would like to change the behaviour to handle datasets as well. It is always possible to get/create a named graph from a conjunctive graph and parse data into that.

I have gotten further by adding publicID=cg.default_context.identifier to the parse invocation. This causes the TriG parser to behave properly (and it is easy to adapt the nquads parser to work from there on). But I am not sure if this is a wise solution to the problem.

I'll mull more on this given time, but it would be good to have more people consider a proper revision of the parsing mechanism for datasets.

This underlies the problems described in #432 and #433 (and is related #428).

(Obviously, this in turn causes the serializers for the same formats to emit unexpected bnode-named graphs when data has been read through these parsers.)

Nov 22 '14 20:11 niklasl

It might make sense that one should simply parse into the default_context of a ConjunctiveGraph or Dataset, like:

cg = rdflib.ConjunctiveGraph()
cg.default_context.parse(data=data, format='trig')
print cg.serialize(format='trig')

By doing it like this (along with a bunch of fairly recent fixes on RDFLib master), this could be considered good enough. It doesn't seem intuitive though.

Leaving this open in case we want to redesign the parsing of datasets to make this more obvious.

Aug 04 '16 12:08 niklasl

hmm, so maybe the 6.0.0 label was wrong? can this go in 4.2.2 then (so no backwards incompatibility) and just be closed and re-opened if desired?

Aug 04 '16 14:08 joernhees

There would be no change by telling users to parse into default_context, that just seems unintuitive.

I'd say leave this open (but for 5.0.0 maybe?) since it is about changing the parsing usage/behaviour when parsing dataset syntaxes (nquads, trig, json-ld and trix). The current wiring of graphs, contexts and underlying stores could really do with such an overhaul.

Aug 04 '16 15:08 niklasl

This issue is still a problem in RDFlib 6.0.2. The workaround of publicID=cg.default_context.identifier does work but is indeed unintuitive.

We really do need to be able to say:

cg = Dataset()
cg.parse("some-quads-file.trig")   # RDF file type worked out by guess_format()

... and then have the default_context == whatever the Trig file said the default graph was.

Dec 07 '21 00:12 nicholascar

Fix is more or less ready, please have a look:

https://github.com/RDFLib/rdflib/pull/2406

May 24 '23 21:05 aucampia

rdflib rdflib copied to clipboard

ConjunctiveGraph doesn't handle parsing datasets with default graphs properly

rdflib
rdflib copied to clipboard