jena Turtle parser mishandles tabs in URIs

Version

4.6.1

What happened?

When the Turtle parser encounters a tab ('\t') in a URI/IRI, the character is treated slightly differently than other similar "bad"/"illegal" characters.

When using a Turtle parser configured for "lax" handling of invalid URIs, like this:

RDFParser.create()
	.source(in)
	.lang(Lang.TURTLE)
	.resolveURIs(false)
	.errorHandler(ErrorHandlerFactory.errorHandlerWarning(null))
	.parse(model);

and the parser encounters a tab in a URI, the result is a NullPointerException in later parser processing:

java.lang.NullPointerException: Cannot invoke "String.startsWith(String)" because "iri" is null
	at org.apache.jena.riot.system.RiotLib.isBNodeIRI(RiotLib.java:107)
	at org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:185)
	at org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
	at org.apache.jena.riot.lang.LangTurtleBase.tokenAsNode(LangTurtleBase.java:577)
	at org.apache.jena.riot.lang.LangTurtleBase.node(LangTurtleBase.java:410)
	at org.apache.jena.riot.lang.LangTurtleBase.triplesNode(LangTurtleBase.java:445)
	at org.apache.jena.riot.lang.LangTurtleBase.objectList(LangTurtleBase.java:419)
	at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurtleBase.java:352)
	at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:333)
	at org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:314)
	at org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:178)
	at org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:46)
	at org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:79)
	at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)

Other problematic characters (e.g. '{', '}', '"') are handled more gracefully: They generate a call to ErrorHandler.warning(...) or .error(...) and, if the error handler does not throw an exception (as in the "lax" case, above), the parser leaves the character in the URI and continues processing.

It seems tabs should be handled the same way.

Relevant output and stacktrace

No response

Are you interested in making a pull request?

Yes

Sep 22 '22 04:09 bvosburgh-tq

There's no input shown.

What effect is the .resolveURIs(false) having? I can't reproduce needing it.

Sep 24 '22 11:09 afs

There's no input shown.

Ahhh, sorry about that. An example URI would be "<http://example/invalid/iri/with_\t_tab>".

What effect is the .resolveURIs(false) having? I can't reproduce needing it.

Agreed, I don't think that is needed to reproduce the problem. It's just how I had the parser configured when I encountered the issue. In the PR, I have zeroed in on the real problem, which is in TokenizerText.

Sep 24 '22 12:09 bvosburgh-tq