rdflib
rdflib copied to clipboard
Bug: Unexpected namespace creation during turtle file serialization
Hello! I've noticed that serializing a ttl file has an unexpected behavior where adding a triple to a blank graph and then serializing it randomly adds a prefix to the turtle file:
import rdflib
TRIPLE = (
rdflib.URIRef("http://example1.com/s"),
rdflib.URIRef("http://example2.com/p"),
rdflib.Literal("some literal"),
)
g = rdflib.Graph(bind_namespaces="none")
g.add(TRIPLE)
print("Namespaces Before:", list(g.namespaces()))
x = g.serialize(format="turtle")
print(x)
print("Namespaces After:", list(g.namespaces()))
Results in:
Namespaces Before: []
@prefix ns1: <http://example2.com/> .
<http://example1.com/s> ns1:p "some literal" .
Namespaces After: [('ns1', rdflib.term.URIRef('http://example2.com/'))]
When someone would expect:
Namespaces Before: []
<http://example1.com/s> <http://example2.com/p> "some literal" .
Namespaces After: []
I've boiled it down to the following line: https://github.com/RDFLib/rdflib/blob/fb43b7afe80175aedd87506899dff2ccdb312c66/rdflib/plugins/serializers/turtle.py#L270
Here we create a new prefix if we're looking at the predicate of a triple during serialization. I can't follow the blame of this change or docs explaining that serialize modifies the graph. Does anyone know why this was put there and if it can be set to self.getQName(node, gen_prefix=False)
? This seems to have already been done for trig files #2467 .
Running into the same issue
Please solve this!
Quick Note: I've been able to patch this bug for now by overriding the getQName()
method:
class FixedTurtleSerializer(TurtleSerializer):
def getQName(self, uri, gen_prefix=True):
return super().getQName(uri, gen_prefix=False)
This fixes the fact that there are multiple places in this serializer that call the method. I'm considering throwing in a PR to adjust the behavior of serialize to not generate namespaces by default since:
- There are no docs explaining this behavior
- Serializers generally don't modify the data of what they are serializing
- If generating prefixes is done to help optimize the resulting ttl, it would make more sense to apply this generation to any identifier found in a triple statement rather than just the predicate (this can also be added in the PR)
A possible method could look like:
g = rdflib.Graph(bind_namespaces="none")
serialized_without_prefixes = g.serialize(format="turtle", generate_prefixes=False)
serialized_with_prefixes = g.serialize(format="turtle", generate_prefixes=True)
Any thoughts? I could also go with the reverse approach where the default behavior remains the same and an optional param is added to disable the predicate prefix generation. This would be non-breaking, but could be less intuitive for new users.
Having the two options - to generate and to not generate prefixes - with a documented default sounds great, please do make a PR!