rdflib
rdflib copied to clipboard
RDFlib makes invalid Turtle cURIs
Some cURIs cannot be parsed with RDFlib, even those produced by RDFlib.
For example:
If we have the following RDF/TURTLE:
@prefix ex: <https://example.org/term#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<http://example.org/entity> ex:anchorOf "This press"^^xsd:string ;
a <http://dbpedia.org/resource/This_(journal)> .
We load the TTL using g.parse(data=ttl)
and we serialize it with g.serialize()
we get:
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix ex: <https://example.org/term#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/entity> a dbr:This_(journal) ;
ex:anchorOf "This press"^^xsd:string .
Then, if we load the output again with g.parse(data=ttl, format="turtle")
, we get the following errorr:
rdflib.plugins.parsers.notation3.BadSyntax: at line 5 of <>: Bad syntax (expected '.' or '}' or ']' at end of statement) at ^ in: "...b'/2001/XMLSchema#> .\n\nhttp://example.org/entity a dbr:This_'^b'(journal) ;\n ex:anchorOf "This press"^^xsd:string .'"
The error is cause by dbr:This_(journal)
, more presicely by the '(' and ')'. It seems RDFlib does not like its own output.
Any ideas on what may be happing here?
@Ga11u yes it is a bug and I think we've seen this pop up here and there but thanks to you for making the problem explicit now in an Issue. I've flagged it as a bug.
Surely if a URI shortening with prefix makes an invalid cURI, then the shortening should not be implemented for that URI, even if the prefix is set!
I've tried this data in Jena like so:
// test data, with an additional test triple
String d1 = "@prefix ex: <https://example.org/term#> ."+
"@prefix xsd: <http://www.w3.org/2001/XMLSchema#>."+
"@prefix dbr: <http://dbpedia.org/resource/> ."+
"@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ."+
""+
"<http://example.org/entity> ex:anchorOf \"This press\"^^xsd:string ;"+
" a <http://dbpedia.org/resource/This_(journal)> ;"+
" ex:something <http://dbpedia.org/resource/OtherFakeResource> .";
// read the Turtle data into a Jena Model
Model m1 = ModelFactory.createDefaultModel();
m1.read(IOUtils.toInputStream(d1, "UTF-8"), null, "TTL");
// print the Jena Model in Turtle to check
StringWriter d2 = new StringWriter();
m1.write(System.out, "TTL" );
// we get:
/*
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix ex: <https://example.org/term#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/entity>
rdf:type <http://dbpedia.org/resource/This_(journal)> ;
ex:anchorOf "This press" ;
ex:something dbr:OtherFakeResource .
*/
So Jena compresses the http://dbpedia.org/resource/OtherFakeResource
since it makes a valid cURI, but it doesn't compress http://dbpedia.org/resource/This_(journal)
and thus can repeat parse/write cycles just fine.
Renaming issue to better capture the problem
Need to check if dbr:This_(journal)
really is an invalid cURI or rather an invalid Turtle prefixed URI (I forget the exact name).
@nicholascar I want to work on this issue. Can you suggest how can I start with it? I have cloned the repo and reproduced the error. Can you please suggest some possible fixes? Thank you!
@nicholascar I want to work on this issue. Can you suggest how can I start with it? I have cloned the repo and reproduced the error. Can you please suggest some possible fixes? Thank you!
One possible workaround is to check whether the reference part of the CURIE (i.e., the part after the :
) contains (
, )
, [
, ]
or any other character that may cause the error. If it contains the character then the RDFlib parses afull URI instead of a CURIE.
This solution should emulate the same behaviour as Jena. Hower, I am not sure if this is the optimal or best way to go as we are not solving the problem, just adding a patch.
(I am using the word CURIE instead of cURI. They are the same, but I think the W3C uses CURIE)
@kone807 I do think the starting point here is to review the formal definitions of CURIE and the Turtle specification, as @Ga11u says above and then, when they are known to you, create tests that should pass but currently fail. If you can do that, you might then be able to either design a solution or ask other RDFLib maintainers for assistance.
We really do need to confirm what is allowed and then we can see if the Jena approach is valid. If it is, I think the coding to find certain characters in the potential CURIES won't be that hard so preventing RDFLib from shortening problematic CURIES will be possible.
CURIE mentions the following -
The concatenation of the prefix value associated with a CURIE and its reference MUST be an IRI (as defined by the IRI production in [IRI]).
According to URI rules, '(', ')' are valid values. Using '[', ']', '{', '}', '^' give an error.
Also, I tried to reproduce the error on my system and I don't seem to get one. Here's the code -
from rdflib import Graph
import pprint
"""
test.ttl
@prefix ex: <https://example.org/term#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<http://example.org/entity> ex:anchorOf "This press"^^xsd:string ;
a <http://dbpedia.org/resource/This_(journal)> .
"""
g = Graph()
g.parse("test.ttl",format="ttl")
for stmt in g:
pprint.pprint(stmt)
print("\n",25*"==","\n")
g.serialize(destination="test_parsed.ttl")
"""
test_parsed.ttl
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix ex: <https://example.org/term#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/entity> a dbr:This_\(journal\) ;
ex:anchorOf "This press"^^xsd:string .
"""
g1 = Graph()
g1.parse("test_parsed.ttl",format="ttl")
for stmt in g1:
pprint.pprint(stmt)
Observation: In contrast to original snippet, the parenthesis are now preceded by a forward slash which somehow results in correct parsing.
@kone807 Have you tried to serialize without saving into a file, just g.serialize()
? You should get a string which I assume it will not contain the back slash and then, you can parse the string again. This should give you the error.
There are a couple of different scenarios in which Python -> RDF -> Python -> RDF cycles break due to string encoding, for example, regular expressions in SHACL sh:pattern
properties. The problem there is that there are just too many different encoding systems at play that require different kinds of character escaping which get in each others way! It may not always be feasible to solve absolutely all of these issues.
If the cycle above passes, let's write that up as a pytest test. Then yes, another one can be made that serializes/deserializes to a file and we can see if that one fails.
@Ga11u I tried and still works fine -
from rdflib import Graph
import pprint
"""
test.ttl
@prefix ex: <https://example.org/term#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<http://example.org/entity> ex:anchorOf "This press"^^xsd:string ;
a <http://dbpedia.org/resource/This_(journal)> .
"""
g = Graph()
g.parse("test.ttl",format="ttl")
for stmt in g:
pprint.pprint(stmt)
print("\n",25*"==","\n")
s = g.serialize()
print(s)
"""
test_parsed.ttl
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix ex: <https://example.org/term#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/entity> a dbr:This_\(journal\) ;
ex:anchorOf "This press"^^xsd:string .
"""
g1 = Graph()
g1.parse(data=s,format="ttl")
for stmt in g1:
pprint.pprint(stmt)
@nicholascar so I should try to find a cycle which leads to the intended error and write similar test cases for it (as currently I can't seem to replicate it using RDF -> Python -> RDF -> Python or RDf -> Python -> String -> Python). Did I understand it correctly?
It is important here (in parsing and serializing various forms) to carefully notice and comply with the appropriate specifications:
- QNames (for XML elements and attributes)
- PNames (for Turtle, TriG and SPARQL)
- CURIEs (for RDFa, mainly)
- Compact IRIs (for JSON-LD; mostly(?) identical to CURIEs)
This list is informally ordered from most to least restrictive in terms of what characters are allowed. In PNames, more are allowed than in QNames, but some have to be escaped; in CURIEs, no valid IRI character has to be escaped in the local part. The rules for how prefixes are defined and their forms also differ. (In all but QNames, the _
prefix is used for blank node identifiers. In RDFa and JSON-LD, CURIEs mainly share the same lexical space as regular IRIs, for better or worse.)
@kone807 Have you tried with format="turtle"
instead of ttl
?
Some cURIs cannot be parsed with RDFlib, even those produced by RDFlib.
That may be true but I can't seem to reproduce the issue you encountered with parsing the serialized example whether it's serialized to a string and parsed, serialized and directly fed to the parser or serialized to and read back from a file.
def test_turtle_serialize():
test_ttl = """@prefix ex: <https://example.org/term#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<http://example.org/entity> ex:anchorOf "This press"^^xsd:string ;
a <http://dbpedia.org/resource/This_(journal)> .
"""
expected_statement = (
URIRef('http://example.org/entity'),
URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
URIRef('http://dbpedia.org/resource/This_(journal)')
)
g = Graph()
g.parse(data=test_ttl, format="turtle")
assert sorted(list(g))[0] == expected_error
data_str = g.serialize(format="turtle")
assert data_str == """\
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix ex: <https://example.org/term#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/entity> a dbr:This_\(journal\) ;
ex:anchorOf "This press"^^xsd:string .
"""
g1 = Graph()
g1.parse(data=data_str, format="turtle")
assert sorted(list(g1))[0] == expected_statement
assert g1.serialize(format='turtle') == data_str
g2 = Graph()
g2.parse(data=g1.serialize(format='turtle'), format="turtle")
assert sorted(list(g2))[0] == expected_statement
assert g2.serialize(format='turtle') == data_str
g.serialize("/tmp/test.ttl", format='turtle')
"""
$ cat /tmp/test.ttl
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix ex: <https://example.org/term#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/entity> a dbr:This_\(journal\) ;
ex:anchorOf "This press"^^xsd:string .
"""
data = open("/tmp/test.ttl", 'r').read()
g3 = Graph()
g3.parse(data=data, format="turtle")
assert sorted(list(g3))[0] == expected_statement
assert g3.serialize(format='turtle') == data_str
with open("/tmp/test.ttl", "w") as fp:
fp.write(test_ttl)
fp.close()
"""
$ cat /tmp/test.ttl
@prefix ex: <https://example.org/term#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<http://example.org/entity> ex:anchorOf "This press"^^xsd:string ;
a <http://dbpedia.org/resource/This_(journal)> .
"""
g4 = Graph()
datafile = open("/tmp/test.ttl", 'r')
g4.parse(file=datafile, format="turtle")
assert sorted(list(g4))[0] == expected_statement
assert g4.serialize(format='turtle') == data_str
g5 = Graph()
g5.parse("/tmp/test.ttl", format="turtle")
assert sorted(list(g5))[0] == expected_statement
assert g5.serialize(format='turtle') == data_str
The bug was identified last year using the RDFlib v5 and we are currently at RDFlib v6.1. I revised the releases and it seams there have been several updates on the parsers and serializers . Is it posible that the bug is already fixed? At least, the outputs you show present a backslash wich was not provided in the previous versions.
However, I am wondering if adding a backslash is the best approach and how compatible it is with other libraries or applications. One possible drawback of this approach would be string parsing using regex or other libraries that may resolve dbr:This_\(journal\)
to <http://dbpedia.org/resource/This_\(journal\)>
instead of its original form. Using regex it would not be obvious that This_(journal)
is the same as This_\(journal\)
. And also, using entity matching dbr:This_\(journal\)
may not be identified as the same as dbr:This_(journal)
.
The bug was identified last year using the RDFlib v5 and we are currently at RDFlib v6.1. I revised the releases and it seams there have been several updates on the parsers and serializers . Is it posible that the bug is already fixed? At least, the outputs you show present a backslash wich was not provided in the previous versions.
Yes, that's entirely possible. I'm just trying to narrow down the scope of the extant issue is all.
However, I am wondering if adding a backslash is the best approach and how compatible it is with other libraries or applications.
Fair comment, it's not compatible, at least as far as RDFConvert (elderly but still serviceable, just like yours truly :smile:) is concerned:
$ ~/Apps/rdfconvert-0.4/bin/rdfconvert.sh -i Turtle -o N3 /tmp/test.ttl
11:51:39.534 [main] ERROR nz.co.rivuli.rio.convert.RDFConvert - Syntax error in input file:
11:51:39.535 [main] ERROR nz.co.rivuli.rio.convert.RDFConvert - IRI includes string escapes: '\40' [line 7]
So that leaves us back at @niklasl's cautioning observation: "It is important here (in parsing and serializing various forms) to carefully notice and comply with the appropriate specifications"
It seems to work now with rdflib 7.0.0 (tested via ttlser)
$ echo '
@prefix foo: <http://foo#> .
<http://foo#bar(ciao)> foo:a "ciao" .
' | ttlfmt
@prefix foo: <http://foo#> .
### Annotations
foo:bar\(ciao\) foo:a "ciao" .
### Serialized using the ttlser deterministic serializer v1.2.1
@Ga11u @aucampia can we say that this is now fixed?