rdflib icon indicating copy to clipboard operation
rdflib copied to clipboard

N-Quads serializer ignores default graph

Open edmondchuc opened this issue 3 years ago • 19 comments

The following script can be run as-is:

from rdflib import Dataset

data = """
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

{
	_:b0 <http://www.w3.org/ns/prov#generatedAtTime> "2012-04-09"^^xsd:date .
}

_:b0 {
	<http://greggkellogg.net/foaf#me> a <http://xmlns.com/foaf/0.1/Person> ;
		<http://xmlns.com/foaf/0.1/knows> "http://manu.sporny.org/about#manu" ;
		<http://xmlns.com/foaf/0.1/name> "Gregg Kellogg" .

	<http://manu.sporny.org/about#manu> a <http://xmlns.com/foaf/0.1/Person> ;
		<http://xmlns.com/foaf/0.1/knows> "http://greggkellogg.net/foaf#me" ;
		<http://xmlns.com/foaf/0.1/name> "Manu Sporny" .
}


"""

g = Dataset()
g.parse(data=data, format="trig")

g.print(format="nquads")

Output:

_:nde95dc418226482f9fb7b0242109b9a3b1 <http://www.w3.org/ns/prov#generatedAtTime> "2012-04-09"^^<http://www.w3.org/2001/XMLSchema#date> _:Neae5d6b422ed4d1d872dd9674af22f8f .
<http://greggkellogg.net/foaf#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> _:nde95dc418226482f9fb7b0242109b9a3b1 .
<http://manu.sporny.org/about#manu> <http://xmlns.com/foaf/0.1/name> "Manu Sporny" _:nde95dc418226482f9fb7b0242109b9a3b1 .
<http://manu.sporny.org/about#manu> <http://xmlns.com/foaf/0.1/knows> "http://greggkellogg.net/foaf#me" _:nde95dc418226482f9fb7b0242109b9a3b1 .
<http://greggkellogg.net/foaf#me> <http://xmlns.com/foaf/0.1/knows> "http://manu.sporny.org/about#manu" _:nde95dc418226482f9fb7b0242109b9a3b1 .
<http://manu.sporny.org/about#manu> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> _:nde95dc418226482f9fb7b0242109b9a3b1 .
<http://greggkellogg.net/foaf#me> <http://xmlns.com/foaf/0.1/name> "Gregg Kellogg" _:nde95dc418226482f9fb7b0242109b9a3b1 .

Issue

I would have expected the first statement of the output to omit the graph label as it is a statement in the default graph.

_:nde95dc418226482f9fb7b0242109b9a3b1 <http://www.w3.org/ns/prov#generatedAtTime> "2012-04-09"^^<http://www.w3.org/2001/XMLSchema#date> _:Neae5d6b422ed4d1d872dd9674af22f8f .

See https://www.w3.org/TR/n-quads/#simple-triples for reference.

edmondchuc avatar Apr 17 '22 15:04 edmondchuc

Hmm, it's somewhat related to https://github.com/RDFLib/rdflib/issues/1804.

edmondchuc avatar Apr 17 '22 15:04 edmondchuc

May be related to this also:

  • https://github.com/RDFLib/rdflib/blob/6f2c11cd2c549d6410f9a1c948ab3a8dbf77ca00/test/variants/rdf11trig_eg2.trig
  • https://github.com/RDFLib/rdflib/blob/6f2c11cd2c549d6410f9a1c948ab3a8dbf77ca00/test/variants/rdf11trig_eg2.nq
  • https://github.com/RDFLib/rdflib/blob/6f2c11cd2c549d6410f9a1c948ab3a8dbf77ca00/test/test_graph/test_variants.py#L144-L166
  • https://github.com/RDFLib/rdflib/blob/6f2c11cd2c549d6410f9a1c948ab3a8dbf77ca00/test/test_roundtrip.py#L145-L166

EDIT: Actually on second thought no, maybe not.

aucampia avatar Apr 17 '22 15:04 aucampia

I guess this is a more general issue with how rdflib serializes context-aware stores. Changing the output format to trig results in the same issue, thus breaking round-tripping.

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ns1: <urn:x-rdflib:> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

_:nd65e1d4bf7a34d92a06e4e619a245037b1 {
    <http://greggkellogg.net/foaf#me> a foaf:Person ;
        foaf:knows "http://manu.sporny.org/about#manu" ;
        foaf:name "Gregg Kellogg" .

    <http://manu.sporny.org/about#manu> a foaf:Person ;
        foaf:knows "http://greggkellogg.net/foaf#me" ;
        foaf:name "Manu Sporny" .
}

_:N427df78321f84718beec24f5f0c7e26c {
    [] prov:generatedAtTime "2012-04-09"^^xsd:date .
}

edmondchuc avatar Apr 17 '22 15:04 edmondchuc

I noticed this issue while I was working on implementing a more efficient integration of pyld as a parser into rdflib core https://github.com/RDFLib/rdflib/pull/1836.

My implementation sets the graph name to rdflib.graph.DATASET_DEFAULT_GRAPH_ID when the statement is from the default graph and correctly serializes the dataset.

I noticed the output in the format nquads and trig were different. Those parsers and serializers fail round-trips by ignoring the default graph and incorrectly setting it to a blank node.

An easy fix (I think) is to set the context to rdflib.graph.DATASET_DEFAULT_GRAPH_ID in the failing serializers.

edmondchuc avatar Apr 17 '22 16:04 edmondchuc

To add further to this, it may be that those other serializers are adding statements from the default graph as None which results in adding those statements to a graph labelled with a blank node. I need to confirm this.

For example:

from rdflib.graph import DATASET_DEFAULT_GRAPH_ID

# Instead of this
store.add((s, p, o), None)

# Do this
store.add((s, p, o), DATASET_DEFAULT_GRAPH_ID)

edmondchuc avatar Apr 17 '22 16:04 edmondchuc

An easy fix (I think) is to set the context to rdflib.graph.DATASET_DEFAULT_GRAPH_ID in the failing serializers.

Oops, I take that back. This only works correctly for trig.

The nquads serializer just need to omit the graph label when it sees rdflib.graph.DATASET_DEFAULT_GRAPH_ID.

Currently it serializes something like:

_:b0 <http://www.w3.org/ns/prov#generatedAtTime> "2012-04-09"^^<http://www.w3.org/2001/XMLSchema#date> <urn:x-rdflib:default> .
<http://manu.sporny.org/about#manu> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> _:b0 .
<http://greggkellogg.net/foaf#me> <http://xmlns.com/foaf/0.1/knows> "http://manu.sporny.org/about#manu"^^<http://www.w3.org/2001/XMLSchema#string> _:b0 .
<http://manu.sporny.org/about#manu> <http://xmlns.com/foaf/0.1/knows> "http://greggkellogg.net/foaf#me"^^<http://www.w3.org/2001/XMLSchema#string> _:b0 .
<http://greggkellogg.net/foaf#me> <http://xmlns.com/foaf/0.1/name> "Gregg Kellogg"^^<http://www.w3.org/2001/XMLSchema#string> _:b0 .
<http://greggkellogg.net/foaf#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> _:b0 .
<http://manu.sporny.org/about#manu> <http://xmlns.com/foaf/0.1/name> "Manu Sporny"^^<http://www.w3.org/2001/XMLSchema#string> _:b0 .

Notice the <urn:x-rdflib:default>.

edmondchuc avatar Apr 17 '22 16:04 edmondchuc

The nquads serializer just need to omit the graph label when it sees rdflib.graph.DATASET_DEFAULT_GRAPH_ID.

You're not wrong, I've been addressing this in the dataset re-work, changes to nquads serializer here

ghost avatar Apr 17 '22 17:04 ghost

The nquads serializer just need to omit the graph label when it sees rdflib.graph.DATASET_DEFAULT_GRAPH_ID.

You're not wrong, I've been addressing this in the dataset re-work, changes to nquads serializer here

Can't we get this in without breaking changes?

aucampia avatar Apr 17 '22 17:04 aucampia

Can't we get this in without breaking changes?

Yes. In this instance, correcting the serialization doesn't cause any breaking changes.

ghost avatar Apr 17 '22 17:04 ghost

Thanks for your work @gjhiggins. I've copied your code from the nquads serializer out into a separate PR. I hope you don't mind. I need this patch to get the JSON-LD 1.1 tests to pass.

Can I ask why + [DATASET_DEFAULT_GRAPH_ID] is required?

https://github.com/RDFLib/rdflib/blob/4fba0ffc54cb22f1a4ce94a0ba0a93b7a8923c03/rdflib/plugins/serializers/nquads.py#L38-L41

I had to remove it because the serialize method was outputting double the statements.

edmondchuc avatar Apr 20 '22 11:04 edmondchuc

Thanks for your work @gjhiggins. I've copied your code from the nquads serializer out into a separate PR. I hope you don't mind. I need this patch to get the JSON-LD 1.1 tests to pass.

That's cool, I don't mind at all, whatever works for you.

Can I ask why + [DATASET_DEFAULT_GRAPH_ID] is required?

It's a consequence of switching over to Dataset. ConjunctiveGraph.contexts() returns all graphs, including the default graph and Dataset.graphs() doesn't include the (nameless) default graph.

ghost avatar Apr 20 '22 15:04 ghost

Hi, I just noticed when I take the multigraph example from JSON-LD standard and convert it to N-Quads, main graph is suddenly referenced by a blank label, instead of no label. My code is:

from rdflib.graph import Dataset

data = """{
"@context": [
"http://schema.org/",
{"@base": "http://example.com/"}
],
"@graph": [{
"@id": "people/alice",
"gender": [
{"@value": "weiblich", "@language": "de"},
{"@value": "female",   "@language": "en"}
],
"knows": {"@id": "people/bob"},
"name": "Alice"
}, {
"@id": "graphs/1",
"@graph": {
"@id": "people/alice",
"parent": {
"@id": "people/bob",
"name": "Bob"
}
}
}, {
"@id": "graphs/2",
"@graph": {
"@id": "people/bob",
"sibling": {
"name": "Mary",
"sibling": {"@id": "people/bob"}
}
}
}]
}"""

ds = Dataset()
ds.parse(data=data, format="json-ld")
print(ds.serialize(format="nquads").strip())

The result looks like this for me:

<http://example.com/people/bob> <http://schema.org/name> "Bob" <http://example.com/graphs/1> .
<http://example.com/people/alice> <http://schema.org/parent> <http://example.com/people/bob> <http://example.com/graphs/1> .
<http://example.com/people/alice> <http://schema.org/gender> "female"@en _:N6535627397b54eb2b076091aaccf8a98 .
<http://example.com/people/alice> <http://schema.org/name> "Alice" _:N6535627397b54eb2b076091aaccf8a98 .
<http://example.com/people/alice> <http://schema.org/gender> "weiblich"@de _:N6535627397b54eb2b076091aaccf8a98 .
<http://example.com/people/alice> <http://schema.org/knows> <http://example.com/people/bob> _:N6535627397b54eb2b076091aaccf8a98 .
<http://example.com/people/bob> <http://schema.org/sibling> _:Na4b162b6579f4d0a9aa68d2d0f65572c <http://example.com/graphs/2> .
_:Na4b162b6579f4d0a9aa68d2d0f65572c <http://schema.org/name> "Mary" <http://example.com/graphs/2> .
_:Na4b162b6579f4d0a9aa68d2d0f65572c <http://schema.org/sibling> <http://example.com/people/bob> <http://example.com/graphs/2> .

However on the JSON-LD playground, the output for N-Quads conversion looks like this instead:

<http://example.com/people/alice> <http://schema.org/gender> "female"@en .
<http://example.com/people/alice> <http://schema.org/gender> "weiblich"@de .
<http://example.com/people/alice> <http://schema.org/knows> <http://example.com/people/bob> .
<http://example.com/people/alice> <http://schema.org/name> "Alice" .
<http://example.com/people/alice> <http://schema.org/parent> <http://example.com/people/bob> <http://example.com/graphs/1> .
<http://example.com/people/bob> <http://schema.org/name> "Bob" <http://example.com/graphs/1> .
<http://example.com/people/bob> <http://schema.org/sibling> _:b0 <http://example.com/graphs/2> .
_:b0 <http://schema.org/name> "Mary" <http://example.com/graphs/2> .
_:b0 <http://schema.org/sibling> <http://example.com/people/bob> <http://example.com/graphs/2> .

Is this issue likely to be solved soon?

sdasda7777 avatar Apr 12 '23 12:04 sdasda7777

So is there a way to avoid <urn:x-rdflib:default> when serializing Dataset? I'm using N-Quads.

namedgraph avatar Jun 08 '23 12:06 namedgraph

@namedgraph In what sense is the <urn:x-rdflib:default> an issue for you? Any N-Quads parser should parse that as data in the default graph, right?

I think there was some trick to it, where the default graph will or won't be in there depending on how you insert it into the Dataset, but I would avoid depending on that as that could change at any time without as much as a notice.

sdasda7777 avatar Jun 08 '23 13:06 sdasda7777

@namedgraph In what sense is the <urn:x-rdflib:default> an issue for you? Any N-Quads parser should parse that as data in the default graph, right?

It is an issue because the default graph should not have a name, as soon as it does it is no longer the default graph.

aucampia avatar Jun 08 '23 13:06 aucampia

@namedgraph In what sense is the <urn:x-rdflib:default> an issue for you? Any N-Quads parser should parse that as data in the default graph, right?

Uhh, no? This is not standard in any way. The 4th element of a quad should be omitted for triples in the default graph:

The graph label IRI can be omitted, in which case the triples are considered part of the default graph of the RDF dataset.

https://www.w3.org/TR/n-quads/#simple-triples

namedgraph avatar Jun 08 '23 13:06 namedgraph

Uhh, no? This is not standard in any way.

My bad, you're right. Actually seems like some kind of internal rdflib thing that's leaking out by accident.

The 4th element of a quad should be omitted for triples in the default graph:

The graph label IRI can be omitted, in which case the triples are considered part of the default graph of the RDF dataset.

Just for completeness, I don't think this is exactly true. While it does say that if there is no graphLabel, it should be in the default graph, I don't think it specifies that a default graph may not be refered to using an IRI, in case that ever got standardised.

sdasda7777 avatar Jun 08 '23 13:06 sdasda7777

So is there a way to avoid <urn:x-rdflib:default> when serializing Dataset? I'm using N-Quads.

Not that I know of, I will be working on fixing the Dataset issue in the coming months but it is all a bit tangled.

aucampia avatar Jun 08 '23 16:06 aucampia

While it does say that if there is no graphLabel, it should be in the default graph, I don't think it specifies that a default graph may not be refered to using an IRI, in case that ever got standardised.

Kinda explicit in the wording: “The default graph does not have a name”.

My understanding is that this is inherited from SPARQL: a query that does not specify a graph name is posed of the default graph --- which in consequence, cannot have a name.

However, RDFLib binds an identifier to every graph (probably inherited from the extant implementations of Store) and if an identifier isn't provided, a BNode is used.

In consequence, in the RDFLib implementation, a Dataset's default graph, being an RDFLib Graph, is (for the time being, unavoidably) assigned the (internal) identifier DATASET_DEFAULT_GRAPH_ID (bound to urn:x-rdflib:default) but this is not intended for external consumption - use of the Dataset().default_graph reference is recommended.

So is there a way to avoid <urn:x-rdflib:default> when serializing Dataset? I'm using N-Quads.

Because the default graph doesn't have a name, that's a must - but there are some slightly-inobvious consequences.

I've spent some time looking into the issues here and I do have a mostly-complete solution that I'm using to tease out some of the options. If you'll forgive me some elaboration, I'm including some example code that uses as input a slightly-changed test/data/sportsquads.trig, having added a couple of triples: a student_30 with foaf:name "Dudley Moore":

diff --git a/test/data/sportquads.trig b/test/data/sportquads.trig
+
+<http://example.com/resource/student_30> a ont:Student ;
+        foaf:name "Dudley Moore" .

And some annotated test code ...

def test_dataset_serialize():
    d1 = Dataset()
    d1.parse(
        TEST_DATA_DIR / "sportquads.trig",  # Augmented with the two triples mentioned
        format="trig",
        publicID=""  #  Uncontextualised statements -> default_graph
    )
    assert len(d1) == 2  # uncontextualised statements (“triples”) in the default graph

    # And the contexts created ...
    assert sorted(list(d1.contexts())) == [
        URIRef('http://example.org/graph/practise'),
        URIRef('http://example.org/graph/sports'), 
        URIRef('http://example.org/graph/students'),
    ]  # Note: no mention of `<urn:x-rdflib:default>` aka “the graph with no name”

    # it serializes as expected ...
    assert sorted(d1.serialize(format="nquads").splitlines()) == [
        "",
        "<http://example.com/resource/sport_100> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ontology/Sport> <http://example.org/graph/sports> .",
        '<http://example.com/resource/sport_100> <http://www.w3.org/2000/01/rdf-schema#label> "Tennis" <http://example.org/graph/sports> .',
        "<http://example.com/resource/student_10> <http://example.com/ontology/practises> <http://example.com/resource/sport_100> <http://example.org/graph/practise> .",
        "<http://example.com/resource/student_10> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ontology/Student> <http://example.org/graph/students> .",
        '<http://example.com/resource/student_10> <http://xmlns.com/foaf/0.1/name> "Venus Williams" <http://example.org/graph/students> .',
        "<http://example.com/resource/student_20> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ontology/Student> <http://example.org/graph/students> .",
        '<http://example.com/resource/student_20> <http://xmlns.com/foaf/0.1/name> "Demi Moore" <http://example.org/graph/students> .',
        "<http://example.com/resource/student_30> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ontology/Student>  .",
        '<http://example.com/resource/student_30> <http://xmlns.com/foaf/0.1/name> "Dudley Moore"  .',
    ]  # uncontextualized statements preserved as such, just as in the trig source

    # Quads are no issue so let's work with uncontextualized statements
    sportstriples = d1.serialize(format='nt')  # Decontextualize the statements

    # Use nquads parser to read triples into the default graph
    d2 = Dataset()
    d2.parse(
        data=sportstriples,
        format="nquads")  # Read uncontextualized statements as nquads
    assert len(d2) == 9  # All parsed into the default graph
    assert len(list(d2.contexts())) == 0  # only named graphs are contexts

    # Use nquads parser to read triples into a named graph (aka “context”)
    d3 = Dataset()
    d3.parse(
        data=sportstriples,
        format="nquads",
        publicID=context0  # Assert a context for the uncontextualised statements
    )
    assert len(d3) == 0  # No triples in default graph
    assert len(d3.graph(context0)) == 9  # All statements now contextualized
    assert list(d3.contexts()) == [
        URIRef('urn:example:context-0')
    ]  # Only one context, as specified

    # Now back to `d1` and some fun stuff ...
    assert len(d1) == 2  # the two added triples
    d1.default_union = True
    assert len(d1) == 9  # decontextualise all statements
    d1.default_union = False
    assert len(d1) == 2  # back to base

Why is it “fun stuff” - because of SPARQL_DEFAULT_GRAPH_UNION - “If True - the default graph in the RDF Dataset is the union of all named graphs”

It is indeed tangled, the reason why this isn't a draft PR is that I'm playing whack-a-mole with the tests :smile:

ghost avatar Jun 08 '23 18:06 ghost