jena icon indicating copy to clipboard operation
jena copied to clipboard

Poor performance when parsing huge literal in query (e.g. 100MB)

Open SimonBin opened this issue 2 years ago • 31 comments

The cause seems to be https://github.com/javacc/javacc/issues/72

We encountered this issue when a SPARQL SERVICE clause was sending a large-ish Geometry literal of USA to Fuseki. It stalls forever trying to parse the query.

Ideally, the buffer would expand exponentially or there is an alternative PR linked in the javacc issue. Currently, the parsing buffer is apparently grown in steps of 2KiB

jstack:
"qtp771666241-136" #136 prio=5 os_prio=0 cpu=13385,35ms elapsed=5538,68s tid=0x00007fa188007800 nid=0x15730d runnable  [0x00007fa1341f9000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.jena.sparql.lang.arq.SimpleCharStream.ExpandBuff(SimpleCharStream.java:42)
	at org.apache.jena.sparql.lang.arq.SimpleCharStream.FillBuff(SimpleCharStream.java:103)
	at org.apache.jena.sparql.lang.arq.SimpleCharStream.readChar(SimpleCharStream.java:197)
	at org.apache.jena.sparql.lang.arq.ARQParserTokenManager.jjMoveNfa_0(ARQParserTokenManager.java:4369)
	at org.apache.jena.sparql.lang.arq.ARQParserTokenManager.jjMoveStringLiteralDfa0_0(ARQParserTokenManager.java:211)
	at org.apache.jena.sparql.lang.arq.ARQParserTokenManager.getNextToken(ARQParserTokenManager.java:4793)
	at org.apache.jena.sparql.lang.arq.ARQParser.jj_ntk_f(ARQParser.java:8162)
	at org.apache.jena.sparql.lang.arq.ARQParser.PathElt(ARQParser.java:3603)
	at org.apache.jena.sparql.lang.arq.ARQParser.PathEltOrInverse(ARQParser.java:3635)
	at org.apache.jena.sparql.lang.arq.ARQParser.PathSequence(ARQParser.java:3565)
	at org.apache.jena.sparql.lang.arq.ARQParser.PathAlternative(ARQParser.java:3544)
	at org.apache.jena.sparql.lang.arq.ARQParser.Path(ARQParser.java:3538)
	at org.apache.jena.sparql.lang.arq.ARQParser.VerbPath(ARQParser.java:3493)
	at org.apache.jena.sparql.lang.arq.ARQParser.PropertyListPathNotEmpty(ARQParser.java:3418)
	at org.apache.jena.sparql.lang.arq.ARQParser.TriplesSameSubjectPath(ARQParser.java:3365)
	at org.apache.jena.sparql.lang.arq.ARQParser.TriplesBlock(ARQParser.java:2512)
	at org.apache.jena.sparql.lang.arq.ARQParser.GroupGraphPatternSub(ARQParser.java:2425)
	at org.apache.jena.sparql.lang.arq.ARQParser.GroupGraphPattern(ARQParser.java:2387)
	at org.apache.jena.sparql.lang.arq.ARQParser.WhereClause(ARQParser.java:858)
	at org.apache.jena.sparql.lang.arq.ARQParser.SelectQuery(ARQParser.java:137)
	at org.apache.jena.sparql.lang.arq.ARQParser.Query(ARQParser.java:31)
	at org.apache.jena.sparql.lang.arq.ARQParser.QueryUnit(ARQParser.java:22)
	at org.apache.jena.sparql.lang.ParserARQ$1.exec(ParserARQ.java:48)
	at org.apache.jena.sparql.lang.ParserARQ.perform(ParserARQ.java:95)
	at org.apache.jena.sparql.lang.ParserARQ.parse$(ParserARQ.java:52)
	at org.apache.jena.sparql.lang.SPARQLParser.parse(SPARQLParser.java:33)
	at org.apache.jena.query.QueryFactory.parse(QueryFactory.java:144)
	at org.apache.jena.query.QueryFactory.create(QueryFactory.java:83)
	at org.apache.jena.fuseki.servlets.SPARQLQueryProcessor.execute(SPARQLQueryProcessor.java:251)
	at org.apache.jena.fuseki.servlets.SPARQLQueryProcessor.executeBody(SPARQLQueryProcessor.java:234)
	at org.apache.jena.fuseki.servlets.SPARQLQueryProcessor.execute(SPARQLQueryProcessor.java:213)
	at org.apache.jena.fuseki.servlets.ActionService.executeLifecycle(ActionService.java:58)
	at org.apache.jena.fuseki.servlets.SPARQLQueryProcessor.execPost(SPARQLQueryProcessor.java:83)
	at org.apache.jena.fuseki.servlets.ActionProcessor.process(ActionProcessor.java:34)
	at org.apache.jena.fuseki.servlets.ActionBase.process(ActionBase.java:55)
	at org.apache.jena.fuseki.servlets.ActionExecLib.execActionSub(ActionExecLib.java:125)
	at org.apache.jena.fuseki.servlets.ActionExecLib.execAction(ActionExecLib.java:99)
	at org.apache.jena.fuseki.server.Dispatcher.dispatchAction(Dispatcher.java:164)
	at org.apache.jena.fuseki.server.Dispatcher.process(Dispatcher.java:156)
	at org.apache.jena.fuseki.server.Dispatcher.dispatch(Dispatcher.java:83)
	at org.apache.jena.fuseki.servlets.FusekiFilter.doFilter(FusekiFilter.java:48)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1600)
	at org.apache.shiro.web.servlet.ProxiedFilterChain.doFilter(ProxiedFilterChain.java:61)
	at org.apache.shiro.web.servlet.AdviceFilter.executeChain(AdviceFilter.java:108)
	at org.apache.shiro.web.servlet.AdviceFilter.doFilterInternal(AdviceFilter.java:137)
	at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125)
	at org.apache.shiro.web.servlet.ProxiedFilterChain.doFilter(ProxiedFilterChain.java:66)
	at org.apache.shiro.web.servlet.AdviceFilter.executeChain(AdviceFilter.java:108)
	at org.apache.shiro.web.servlet.AdviceFilter.doFilterInternal(AdviceFilter.java:137)
	at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125)
	at org.apache.shiro.web.servlet.ProxiedFilterChain.doFilter(ProxiedFilterChain.java:66)
	at org.apache.shiro.web.servlet.AbstractShiroFilter.executeChain(AbstractShiroFilter.java:450)
	at org.apache.shiro.web.servlet.AbstractShiroFilter$1.call(AbstractShiroFilter.java:365)
	at org.apache.shiro.subject.support.SubjectCallable.doCall(SubjectCallable.java:90)
	at org.apache.shiro.subject.support.SubjectCallable.call(SubjectCallable.java:83)
	at org.apache.shiro.subject.support.DelegatingSubject.execute(DelegatingSubject.java:387)
	at org.apache.shiro.web.servlet.AbstractShiroFilter.doFilterInternal(AbstractShiroFilter.java:362)
	at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1600)
	at org.apache.jena.fuseki.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:284)
	at org.apache.jena.fuseki.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:247)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:210)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1600)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:506)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:131)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:578)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1571)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1378)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:463)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1544)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1300)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
	at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:717)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
	at org.eclipse.jetty.server.Server.handle(Server.java:562)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505)
	at org.eclipse.jetty.server.HttpChannel$$Lambda$636/0x000000084084d040.dispatch(Unknown Source)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:319)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
	at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:412)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:381)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:268)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.lambda$new$0(AdaptiveExecutionStrategy.java:138)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy$$Lambda$624/0x0000000840830c40.run(Unknown Source)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:407)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:894)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1038)
	at java.lang.Thread.run([email protected]/Thread.java:829)

query is something simple as

{ ?c ^<http://www.opengis.net/ont/geosparql#sfContains> "<?xml version=\"1.0\" encoding=\"UTF-8\"?><gml:MultiSurface xmlns:gml=\"http://www.opengis.net/ont/gml\" gml:id=\"g2015_2014_0.104.wkb_geometry\" srsDimension=\"2\" srsName=\"urn:ogc:def:crs:EPSG::3857\"><gml:surfaceMember><gml:Polygon gml:id=\"g2015_2014_0.104.wkb_geometry.1\"><gml:exterior><gml:LinearRing><gml:posList>HUGE POS LIST</gml:posList></gml:LinearRing></gml:exterior></gml:Polygon></gml:surfaceMember></gml:MultiSurface>"^^<http://www.opengis.net/ont/geosparql#gmlLiteral> }

automatically injected from a service clause

SimonBin avatar May 19 '22 10:05 SimonBin

Reading the JavaCC issue, it seems the query size is not the problem.

It is the presence of large tokens over 2048 bytes.

@SimonBin Is that correct?

SimpleCharStream and JavaCharStream are not Jena code. They are generated by JavaCC. They are committed to the codebase so people building Jena do not need to install JavaCC themselves.

afs avatar May 19 '22 11:05 afs

Yes I believe this is correct

SimonBin avatar May 19 '22 11:05 SimonBin

I don't see a PR on the javacc issue that is suitable.

There is an interesting suggestion about lexical states. ARQ only parses from strings, not streams, and only from data already already converted UTF-8. Access to the input would enable slicing literals direly out of the string.

Rather than disrupt the existing processing, it could be done with a new token e.g. X"....".

USER_CHAR_STREAM is also an option.

There is some investigation to do such as updating for Javacc 7.0 (the Jena codebase files were produced from JavaCC 6.0). #1328.

FYI: The different parsers use different techniques to handle unicode and it is in some tests about surrogate pairs.

afs avatar May 19 '22 19:05 afs

I tried to regenerate the grammars with javacc/javacc#85 applied, as that sounds promising. (you can check the code here: https://github.com/SANSA-Stack/jena/commit/95c41d29c73167fa19da5ad0f501131f54ae3d58 ) but there seems to be a bug in javacc/javacc#85 which makes the parsing fail with ContentIllegalInProlog

SimonBin avatar May 19 '22 21:05 SimonBin

javacc/javacc#85 has not been integrated into javacc releases.

Jena running it's own fork of javacc is highly undesirable - it's technical debt. Users must be able to build Jena and while we ship the generated code in a Jena release so uses don't have to run javacc themselves (e.g. only old versions are available in Ubuntu app repos) they should be able to.

I'm not even sure javacc/javacc#85 is the right solution - it has access overhead. Maybe that's why ArrayList grows by 1.5 each step.

The case of 100Mb literals in a query is not mainstream :smile: . (Maybe search by SHA512?)

#1328 (JavaCC 7.0 upgrade) is in Jena/main.

(Background: the main RIOT parsers do not use JavaCC.)

ContentIllegalInProlog is an XML error. Maybe it should not have the ?xml part. RDF XML Literals do not have the <?xml>. If you change your example to an rdf:XMLLiteral, there is a warning.

afs avatar May 20 '22 09:05 afs

nb JavaCC 7 does not influence this performance

SimonBin avatar May 20 '22 09:05 SimonBin

The case of 100Mb literals in a query is not mainstream smile . (Maybe search by SHA512?)

True. But to give some background (why SHA512 is not a workaround here): we're currently trying to make use of GeoSPARQL in our projects, we gathered boundaries/borders of administrative regions from an external dataset provided as GML (which in fact is XML) - for USA the polygon is ~100MB, we tried to deal with this initially, spotted the limitation in Jena, and of course in the meantime simplified the polygon directly via SPARQL query function call (with some custom Jena function unfortunately, GeoSPARQL standard lacks a lots of quite things people in my opinion would be happy to have - good thing, we can extend Jena for our work)

LorenzBuehmann avatar May 20 '22 14:05 LorenzBuehmann

Does quite explain why there is a 100Mb literal in a query but no matter.

SPARQL parsing is central so any changes need to be done carefully, and be proven and mature. Like javacc, I'm thinking about unforeseen consequences.

The parser to use is (surprise!) controlled by a registry SPARQLParserRegistry so extension code can change the parser for an experimental one.

Also --

The jaavcc issue suggests a different approach which is also more efficient - lexical states and lexical actions. The string for token image can be created directly without going through the javacc buffering.

afs avatar May 20 '22 19:05 afs

Fair. But even if we cannot fix this - also given that it's more a corner case with literals in MB scale and beyond - we wanted to at least report this behavior - could be a good FAQ entry or the like. Luckily, in our case we own both Fuseki instances A (with the polygons) and B (with the points), thus, we indeed switched the direction, gathering the polygon via a SERVICE request from A and doing the point in polygon check in B compared to passing the polygon from A to B to make the point in polygon check via the SERVICE request.

LorenzBuehmann avatar May 21 '22 06:05 LorenzBuehmann

Yes, good to report.

It might be more of an issue in INSERT DATA although then there is usually the option of POSTing RDF.

Fuseki has "Fuseki modules" so adding a modified parser does not require a complete rebuild.

Or in this case the plain Jena initialization that can modify the server.

The lexical actions approach looks interesting.

afs avatar May 21 '22 09:05 afs

Do you have an actual test case and the grammar somewhere I can look? I'm curious.

new-javacc avatar Jul 06 '22 14:07 new-javacc

javacc/javacc#85 has not been integrated into javacc releases.

Jena running it's own fork of javacc is highly undesirable - it's technical debt. Users must be able to build Jena and while we ship the generated code in a Jena release so uses don't have to run javacc themselves (e.g. only old versions are available in Ubuntu app repos) they should be able to.

Not sure why you needed to fork. Let me know if there is something that I can do to help with that. If the issue is with CharStream, just set USER_CHAR_STREAM=true and write your own stream class that does better buffer management. But like I said most often it's grammar inefficiencies that manifest this way. And use ideas like lexical states to more elegantly handling huge tokens.

I'm not even sure javacc/javacc#85 is the right solution - it has access overhead. Maybe that's why ArrayList grows by 1.5 each step.

Not sure what this is. I don't think we merged into main I will check.

The case of 100Mb literals in a query is not mainstream smile . (Maybe search by SHA512?)

It's not unheard of. But if you expect this to happen even occasionally (as opposed to rarely), you might want to redo your rule for literals. Negation opreator does generate a lot of overhead in terms of state etc.

#1328 (JavaCC 7.0 upgrade) is in Jena/main.

(Background: the main RIOT parsers do not use JavaCC.)

ContentIllegalInProlog is an XML error. Maybe it should not have the ?xml part. RDF XML Literals do not have the <?xml>. If you change your example to an rdf:XMLLiteral, there is a warning.

new-javacc avatar Jul 06 '22 14:07 new-javacc

The issue we experience is buffer management. Linear growth of a few Kbytes for 100Mb is a lot of recopying. If it grew at say x1.5 (like a Java ArrayList) the effect would be much less. (The same issues can arise with Arraylist but much less pronounced).

The grammar is the grammar in the SPARQL specification - it really "is the grammar" because the HTML in th spec was produced from this JavaCC grammar!

The negation is only 3 chars ahead maximum.

@SimonBin - is this triple-quoted literals or single-quoted?

In both cases, if the XML uses " for attributes, then a ' quoting may make a difference but I suspect the buffer expansion is going to dominate.

afs avatar Jul 07 '22 11:07 afs

The main issue is this is a corner case and doesn't make sense to penalize the normal cases. So you have two options

  1. Edit the generated file and make it the way you want - javacc doesn't overwrite existing files. It's a "feature" if you will but probably looks like a hack to the modern developers :)
  2. Use the USER_CHARSTREAM option and implement your own charstream - based on the default one.

But like I originally said, it's best and more performant to use lexical states. For example if it's triple quoted and you don't allow triple quotes in the literal, you can use lexical states and do:

<DEFAUL> MORE: { "'''" : QUOTED_CONTENT }

<QUOTED_CONTENT> TOKEN : { <TRIPLE_QUOTED_LITERAL: "'''> : DEFAU:T } <QUOTED_CONTENT> MORE : { <~[]> }

That just keeps building your literal without actually affecting the chrastream. The performance difference could be huge here - like 10x even as it will start eating the chars one by one until it sees a ''' so no buffering in the stream required as there is no backtracking! And the image of the literal is built using the standard StringBuilder which is hopefully smarter implementation. So I suggest doing that option.

new-javacc avatar Jul 07 '22 14:07 new-javacc

in this case it's a single doublequoted wktLiteral or gmlLiteral automatically injected by Jena into a SERVICE clause as outlined in my initial report

and the bug is JavaCC not defaulting to exponential buffer expansion

this deficiency is basically able to Denial-of-Service a Fuseki server just with a simple SPARQL query

SimonBin avatar Jul 07 '22 15:07 SimonBin

Still the same idea works. Just change it to single quote! And if you have escape, add that also as a MORE rule. See examples in the repo

new-javacc avatar Jul 07 '22 15:07 new-javacc

@SimonBin The contents of the literal (XML use of " or ') is up to the application.

afs avatar Jul 07 '22 21:07 afs

@new-javacc In https://github.com/javacc/javacc/pull/85 the apraoch is a level of indirection but with the advantage that there is no contents copy.

The current reallocation strategy in javacc is char[] newbuffer = new char[bufsize + 2048];

Following the example of ArrayList.newCapacity:

   int newSize = bufsize == 2048 ? bufsize+2048 : (bufsize+(bufsize>>1));
   char[] newbuffer = new char[newSize];`

which does not have the redirection but does have a single continuous buffer (the backup concern on javacc/javacc#85) and the (current) cost of a copy on reallocation. It preserves the current usage experience but which grows faster beyond 6144 bytes (4096+2048) and does less xcopying at large scale.

afs avatar Jul 07 '22 21:07 afs

Sure you can use that as a user charstream impl.

new-javacc avatar Jul 07 '22 21:07 new-javacc

@SimonBin The contents of the literal (XML use of " or ') is up to the application.

Still fine. You need a rule to figure out when to stop the literal. The idea is the same for any token that's delimited by markers

new-javacc avatar Jul 07 '22 21:07 new-javacc

@new-javacc why don't you fix this bug in javacc?

SimonBin avatar Jul 08 '22 05:07 SimonBin

Because its not a bug :) this is a corner case as most tokens are average 3-4 chars long so the buffer actually should never need expansion. And this elegant way of doing demarked literals is more efficient.

new-javacc avatar Jul 08 '22 05:07 new-javacc

@new-javacc thank you for the suggestion for rewriting the grammar but it does not address the original report which is the buffer reallocation which is https://github.com/javacc/javacc/pull/85.

For a 1Mbyte buffer, the current JavaCC strategy does 267,911,168 bytes of copying (copy the buffer on every 2K growth).

Changing to a growth of 1.5, there are is 2,385,340 bytes of copying. It behaves the same as current JavaCC upto 4096.

As this would benefit more than just this project, I've added the possibility to https://github.com/javacc/javacc/pull/85.

@SimonBin - could you please get some VisualVM/YourKit performance hotspot figures to show the time that the code is in buffer allocation?

afs avatar Jul 08 '22 08:07 afs

Here is an excerpt from the VisualVM where you can see that basically all the time is spent in ExpandBuff, 1 minute to just parse a mere 10MB

image

estonia.jfr

If I try to use this with Australia's border, which is only 3.5 times as big, the time requirement just to parse increases to 14 minutes running at 120% CPU all the time

We could never finish parsing the 100MB literal:

image

SimonBin avatar Jul 08 '22 12:07 SimonBin

Thank you - a grammar change may help but ExpandBuff reading more characters is dominant.

afs avatar Jul 08 '22 12:07 afs

Just to clarify - the grammar change eliminates the expandbuf call!! I will revisit the charstream fix it I can see tests that it doesn't impact more common situations

new-javacc avatar Jul 08 '22 13:07 new-javacc

A variant which copes with the escapes may fix the problem - for this project only.

We can add it (the main.jj grammar is in fact two grammars controlled by cpp) - as long as the SPARQ 1.1 isn't touched.

afs avatar Jul 08 '22 16:07 afs

Can you point the rule and the grammar file so I can test/validate it myself.

new-javacc avatar Jul 08 '22 16:07 new-javacc

https://github.com/apache/jena/blob/3210f8b6096b5e13bf4e1b71803c262dea1703c8/jena-arq/Grammar/main.jj#L2713

There is a lot of history here! It also has to align with the RDF data syntax Turtle.

SPARQL has 4 string forms: 2 single quoted (using either " or '), 2 triple-quoted, multiline (using either """ or ''').

The grammar is the ifdef's for ARQ.

| < STRING_LITERAL_LONG1:
     <QUOTE_3S> 
      ( ("'" | "''")? (~["'","\\"] | <ECHAR>  | <UCHAR> ))*
     <QUOTE_3S> >

where <ECHAR: "\\" ( "t"|"b"|"n"|"r"|"f"|"\\"|"\""|"'") > and UCHAR is \u and \U haxe escapes (done by JavaCC only in SPARQL 1.1 form, not the ARQ).

| < #UCHAR:      <UCHAR4> | <UCHAR8> >
| < #UCHAR4:     "\\" "u" <HEX> <HEX> <HEX> <HEX> >
| < #UCHAR8:     "\\" "U" <HEX> <HEX> <HEX> <HEX> <HEX> <HEX> <HEX> <HEX> >

afs avatar Jul 08 '22 17:07 afs

So yeah the MORE pattern still keeps the whole thing in memory :( so I nornally just use SKIP and collect image myself into a buffer like the attached grammar which works well with the default and independent of the Charstream logic/buffering. Some of these idioms were developed in 1996 for Java 1.0 with 32MB RAM machines and mostly desktop apps lol so yeah time for updating them.

Anyway, the chunking charstream is not well tested for correctness or performance so until that happens maybe you can simply get that and use it as a USER_CHARSTREAM (unless you user my SKIP version).

TOKEN_MGR_DECLS:
{
  static StringBuilder sb = new StringBuilder();
}

SKIP:
{
   < STRING_LITERAL_BEGIN: "'''"> : LIT_BODY
}

<LIT_BODY> TOKEN:
{
     <STRING_LITERAL_LONG1: "'''"> { matchedToken.image = (sb.toString()); }: DEFAULT
}

<LIT_BODY> SKIP: {
 < ~[]> { sb.append(image); }
}

new-javacc avatar Jul 09 '22 17:07 new-javacc