Turtle parser mishandles tabs in URIs
Version
4.6.1
What happened?
When the Turtle parser encounters a tab ('\t') in a URI/IRI, the character is treated slightly differently than other similar "bad"/"illegal" characters.
When using a Turtle parser configured for "lax" handling of invalid URIs, like this:
RDFParser.create()
.source(in)
.lang(Lang.TURTLE)
.resolveURIs(false)
.errorHandler(ErrorHandlerFactory.errorHandlerWarning(null))
.parse(model);
and the parser encounters a tab in a URI, the result is a NullPointerException in later parser processing:
java.lang.NullPointerException: Cannot invoke "String.startsWith(String)" because "iri" is null
at org.apache.jena.riot.system.RiotLib.isBNodeIRI(RiotLib.java:107)
at org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:185)
at org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
at org.apache.jena.riot.lang.LangTurtleBase.tokenAsNode(LangTurtleBase.java:577)
at org.apache.jena.riot.lang.LangTurtleBase.node(LangTurtleBase.java:410)
at org.apache.jena.riot.lang.LangTurtleBase.triplesNode(LangTurtleBase.java:445)
at org.apache.jena.riot.lang.LangTurtleBase.objectList(LangTurtleBase.java:419)
at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurtleBase.java:352)
at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:333)
at org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:314)
at org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:178)
at org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:46)
at org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:79)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
Other problematic characters (e.g. '{', '}', '"') are handled more gracefully: They generate a call to ErrorHandler.warning(...) or .error(...) and, if the error handler does not throw an exception (as in the "lax" case, above), the parser leaves the character in the URI and continues processing.
It seems tabs should be handled the same way.
Relevant output and stacktrace
No response
Are you interested in making a pull request?
Yes
There's no input shown.
What effect is the .resolveURIs(false) having? I can't reproduce needing it.
There's no input shown.
Ahhh, sorry about that. An example URI would be "<http://example/invalid/iri/with_\t_tab>".
What effect is the
.resolveURIs(false)having? I can't reproduce needing it.
Agreed, I don't think that is needed to reproduce the problem. It's just how I had the parser configured when I encountered the issue. In the PR, I have zeroed in on the real problem, which is in TokenizerText.