jena icon indicating copy to clipboard operation
jena copied to clipboard

Turtle parser mishandles tabs in URIs

Open bvosburgh-tq opened this issue 3 years ago • 2 comments

Version

4.6.1

What happened?

When the Turtle parser encounters a tab ('\t') in a URI/IRI, the character is treated slightly differently than other similar "bad"/"illegal" characters.

When using a Turtle parser configured for "lax" handling of invalid URIs, like this:

RDFParser.create()
	.source(in)
	.lang(Lang.TURTLE)
	.resolveURIs(false)
	.errorHandler(ErrorHandlerFactory.errorHandlerWarning(null))
	.parse(model);

and the parser encounters a tab in a URI, the result is a NullPointerException in later parser processing:

java.lang.NullPointerException: Cannot invoke "String.startsWith(String)" because "iri" is null
	at org.apache.jena.riot.system.RiotLib.isBNodeIRI(RiotLib.java:107)
	at org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:185)
	at org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
	at org.apache.jena.riot.lang.LangTurtleBase.tokenAsNode(LangTurtleBase.java:577)
	at org.apache.jena.riot.lang.LangTurtleBase.node(LangTurtleBase.java:410)
	at org.apache.jena.riot.lang.LangTurtleBase.triplesNode(LangTurtleBase.java:445)
	at org.apache.jena.riot.lang.LangTurtleBase.objectList(LangTurtleBase.java:419)
	at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurtleBase.java:352)
	at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:333)
	at org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:314)
	at org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:178)
	at org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:46)
	at org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:79)
	at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)

Other problematic characters (e.g. '{', '}', '"') are handled more gracefully: They generate a call to ErrorHandler.warning(...) or .error(...) and, if the error handler does not throw an exception (as in the "lax" case, above), the parser leaves the character in the URI and continues processing.

It seems tabs should be handled the same way.

Relevant output and stacktrace

No response

Are you interested in making a pull request?

Yes

bvosburgh-tq avatar Sep 22 '22 04:09 bvosburgh-tq

There's no input shown.

What effect is the .resolveURIs(false) having? I can't reproduce needing it.

afs avatar Sep 24 '22 11:09 afs

There's no input shown.

Ahhh, sorry about that. An example URI would be "<http://example/invalid/iri/with_\t_tab>".

What effect is the .resolveURIs(false) having? I can't reproduce needing it.

Agreed, I don't think that is needed to reproduce the problem. It's just how I had the parser configured when I encountered the issue. In the PR, I have zeroed in on the real problem, which is in TokenizerText.

bvosburgh-tq avatar Sep 24 '22 12:09 bvosburgh-tq