rdflib icon indicating copy to clipboard operation
rdflib copied to clipboard

Bug: Unexpected namespace creation during turtle file serialization

Open mickremedi opened this issue 9 months ago • 4 comments

Hello! I've noticed that serializing a ttl file has an unexpected behavior where adding a triple to a blank graph and then serializing it randomly adds a prefix to the turtle file:

import rdflib

TRIPLE = (
   rdflib.URIRef("http://example1.com/s"),
   rdflib.URIRef("http://example2.com/p"),
   rdflib.Literal("some literal"),
)

g = rdflib.Graph(bind_namespaces="none")
g.add(TRIPLE)

print("Namespaces Before:", list(g.namespaces()))

x = g.serialize(format="turtle")

print(x)
print("Namespaces After:", list(g.namespaces()))

Results in:

Namespaces Before: []
@prefix ns1: <http://example2.com/> .

<http://example1.com/s> ns1:p "some literal" .


Namespaces After: [('ns1', rdflib.term.URIRef('http://example2.com/'))]

When someone would expect:

Namespaces Before: []
<http://example1.com/s> <http://example2.com/p> "some literal" .


Namespaces After: []

I've boiled it down to the following line: https://github.com/RDFLib/rdflib/blob/fb43b7afe80175aedd87506899dff2ccdb312c66/rdflib/plugins/serializers/turtle.py#L270

Here we create a new prefix if we're looking at the predicate of a triple during serialization. I can't follow the blame of this change or docs explaining that serialize modifies the graph. Does anyone know why this was put there and if it can be set to self.getQName(node, gen_prefix=False)? This seems to have already been done for trig files #2467 .

mickremedi avatar May 03 '24 19:05 mickremedi

Running into the same issue

sardormajano avatar May 06 '24 15:05 sardormajano

Please solve this!

seo-chang avatar May 07 '24 21:05 seo-chang

Quick Note: I've been able to patch this bug for now by overriding the getQName() method:

class FixedTurtleSerializer(TurtleSerializer):
    def getQName(self, uri, gen_prefix=True):
        return super().getQName(uri, gen_prefix=False)

This fixes the fact that there are multiple places in this serializer that call the method. I'm considering throwing in a PR to adjust the behavior of serialize to not generate namespaces by default since:

  • There are no docs explaining this behavior
  • Serializers generally don't modify the data of what they are serializing
  • If generating prefixes is done to help optimize the resulting ttl, it would make more sense to apply this generation to any identifier found in a triple statement rather than just the predicate (this can also be added in the PR)

A possible method could look like:

g = rdflib.Graph(bind_namespaces="none")
serialized_without_prefixes = g.serialize(format="turtle", generate_prefixes=False)
serialized_with_prefixes = g.serialize(format="turtle", generate_prefixes=True)

Any thoughts? I could also go with the reverse approach where the default behavior remains the same and an optional param is added to disable the predicate prefix generation. This would be non-breaking, but could be less intuitive for new users.

mickremedi avatar May 07 '24 22:05 mickremedi

Having the two options - to generate and to not generate prefixes - with a documented default sounds great, please do make a PR!

nicholascar avatar May 18 '24 22:05 nicholascar