Return of OOM
We've dealt with OOM errors before, most prominently documented here in #157 (which was closed and re-opened and then closed again). The last time it was closed we had resolved the issue in #172 by removing @fork='true' from the java calls. But it looks like @fork was added back because of security manager warnings in #247 .
I've just hit this again in a project. Trying to increase memory (as we suggest in the docs)
export ANT_OPTS="-Xmx6g"; ant
still causes the build to fail for this project with OOM issues on the json step after ~ 15 mins or so. However, removing @fork from the java call in json does not throw any SecurityManager issues for me and completes really quickly.
I'm wondering if the security manager issues are still relevant? If so, maybe we could add a fork property to the build so that users needing to fork the processes to access higher memory allocations can do so?
May be useful to note that I am running ant 1.10.14, which is the version cited in this comment about these warnings:
https://bz.apache.org/bugzilla/show_bug.cgi?id=65381#c10
Just tried the same on Jenkins and saw those security manager warnings. I think if I'm understanding things correctly, those security manager warnings derive from versions of Java < 18; ant 1.10.14 and above requires Java 18+; Jenkins uses Java 17, which is why there's an issue there.
Fix committed -- do we need to document this?
Jenkins is currently using Java 21, incidentally.
I'm bringing this issue back, because I'm now seeing OOM errors in the json target for the BBTI project (which ends up needing to create 177,000+ JSON files for 135,000+ tokenized documents). I notice that @fork="true" is still there on the json target java call, so I'm currently testing removing that.
Interesting....it looks like the javaFork param was added https://github.com/projectEndings/staticSearch/commit/9cc071c8c414751915353b873d42830a25692eca but it was never set for the actual target.
I've now tested removing fork="true" and also adding up to 4096m of memory to the java call, and I'm still getting the OOM error. This is a project on which staticSearch 1.4 works perfectly, so we should perhaps investigate further and see if there's anything about the rewritten json.xsl that is using extra memory.
Our first theory is that switching from indented JSON output to non-indented might be causing the problem because of the way Saxon serializes output.
@joeytakeda Per Liam Quin on XML Slack:
If you're using an accumulator over a collection, beware of storing anything from individual documents as that will force the entire document in each case to stay in memory
Could this be the problem? Could we store values or copies instead of items?
Doesn't look like any pointers to nodes are being kept in the accumulator, just copies of strings, so that's not it. The only real differences between the 1.4 release branch and the dev branch are the change from document-uri() to base-uri() and the change from Saxon 10 to Saxon 12. Action on MDH to first try upgrading Saxon to the latest release in dev to see if perhaps a memory leak has been fixed in the lat year; failing that, test in a branch of release-1.4 to see if switching to Saxon 12 and base-uri() reproduces the problem there, in which case we might migrate back to Saxon 10.
@joeytakeda Well, here's a thing: I switched in Saxon 10, and with no other changes, the build works fine.
So this is a problem with Saxon 12. Next I'll try switching in the latest version of Saxon 12, in case they've fixed a memory leak in the last year.
Well, it seems to work fine in Saxon 12.8, the latest version. So I'm going to move the dev branch forward to 12.8.
Working on this in branch iss-323-update-saxon so I can test it properly -- a quick test caused an exception in the xml resolver, so I want to be a bit cautious.
OK, tests are working in the branch. My next test will be to re-pull this branch's content to replace the BBTI staticSearch folder, and to run the BBTI build again. If that works OK, we're done and I'll create a PR.
Turns out it fails in 12.8 too. The success I thought I had may have been a fluke or an illusion. Testing again with Saxon 10 now.
The build works reliably with Saxon 10, but fails reliably (with one exception, which might have been a fluke) with Saxon 12.5 or 12.8.
@joeytakeda I vote we revert to Saxon 10. There's nothing wrong with it, and 12 obviously is less memory-efficient, and has been so for at least a year and four point releases.
Really interesting! Thanks for testing this, Martin. I'm curious if 11 would have the same issues, since it looks like Saxon 12 did introduce a number of optimizations, which could possible be the source of these issues? https://www.saxonica.com/html/documentation12/changes/v12/optimization.html
In any case, I feel mildly uneasy reverting back two major versions, if only because it means we'll have to ensure that these memory issues (which are arguably a bit of an edge case) don't happen for every other major version (especially since we may want to upgrade if, e.g. #300 was resolved).
An alternative solution could be that projects encountering this memory issue may want to pass their own Saxon to staticSearch (e.g. ant -DssSaxon=path/to/saxon10.jar). That property is already there, and we could document it a bit better and confirm that it's usable.
What do you think?
@joeytakeda I agree on testing Saxon 11, and the results will be interesting to see. If there were any history of vulnerabilities in Saxon 10 I would also be nervous about reverting, but there have never been any to my knowledge, and lots of other projects still use it. What I would like to do is to figure out whether what we're seeing is an actual bug in Saxon, or whether recent versions of Saxon have just decided to sacrifice memory efficiency for some other gains deemed more useful.
The other question, of course, is how many other large projects will hit the same problem. I haven't yet tested projects like LEMDO, MoEML, and ColDesp (mostly because they're on svn so it's harder to play around in a branch). But I think that's the next thing to try.
@joeytakeda Tested with Saxon 11.7, and it works fine. So whatever the problem is, it was introduced with Saxon 12. Since 11.7 is still supported, we could move back to that, and raise an issue on the Saxon repo to see if there actually is some kind of bug that could be diagnosed and fixed.
I've now run into this myself: a project with 276 documents (40002 token files) raised OOM errors once I went to v2. But switching the Saxon from staticSearch's (12) to the local one (10) in the antcall works perfectly well, so I think we're right that it's a 12 issue. I'm in favour of switching to 11.
Let's do this tomorrow, then.
276 documents is a very small number, relatively speaking.
Opening a new temporary ticket (#361) for this rollback to Saxon 11; then when we're comfortable that's working OK in dev, we can return to this ticket and start figuring out what's going wrong in Saxon 12.