staticSearch icon indicating copy to clipboard operation
staticSearch copied to clipboard

Return of OOM

Open joeytakeda opened this issue 1 year ago • 19 comments

We've dealt with OOM errors before, most prominently documented here in #157 (which was closed and re-opened and then closed again). The last time it was closed we had resolved the issue in #172 by removing @fork='true' from the java calls. But it looks like @fork was added back because of security manager warnings in #247 .

I've just hit this again in a project. Trying to increase memory (as we suggest in the docs)

export ANT_OPTS="-Xmx6g"; ant 

still causes the build to fail for this project with OOM issues on the json step after ~ 15 mins or so. However, removing @fork from the java call in json does not throw any SecurityManager issues for me and completes really quickly.

I'm wondering if the security manager issues are still relevant? If so, maybe we could add a fork property to the build so that users needing to fork the processes to access higher memory allocations can do so?

joeytakeda avatar Oct 25 '24 23:10 joeytakeda

May be useful to note that I am running ant 1.10.14, which is the version cited in this comment about these warnings:

https://bz.apache.org/bugzilla/show_bug.cgi?id=65381#c10

joeytakeda avatar Oct 25 '24 23:10 joeytakeda

Just tried the same on Jenkins and saw those security manager warnings. I think if I'm understanding things correctly, those security manager warnings derive from versions of Java < 18; ant 1.10.14 and above requires Java 18+; Jenkins uses Java 17, which is why there's an issue there.

joeytakeda avatar Nov 09 '24 18:11 joeytakeda

Fix committed -- do we need to document this?

martindholmes avatar Nov 15 '24 18:11 martindholmes

Jenkins is currently using Java 21, incidentally.

martindholmes avatar Jan 06 '25 16:01 martindholmes

I'm bringing this issue back, because I'm now seeing OOM errors in the json target for the BBTI project (which ends up needing to create 177,000+ JSON files for 135,000+ tokenized documents). I notice that @fork="true" is still there on the json target java call, so I'm currently testing removing that.

martindholmes avatar Feb 27 '25 18:02 martindholmes

Interesting....it looks like the javaFork param was added https://github.com/projectEndings/staticSearch/commit/9cc071c8c414751915353b873d42830a25692eca but it was never set for the actual target.

joeytakeda avatar Feb 27 '25 19:02 joeytakeda

I've now tested removing fork="true" and also adding up to 4096m of memory to the java call, and I'm still getting the OOM error. This is a project on which staticSearch 1.4 works perfectly, so we should perhaps investigate further and see if there's anything about the rewritten json.xsl that is using extra memory.

martindholmes avatar Feb 27 '25 19:02 martindholmes

Our first theory is that switching from indented JSON output to non-indented might be causing the problem because of the way Saxon serializes output.

martindholmes avatar Mar 06 '25 18:03 martindholmes

@joeytakeda Per Liam Quin on XML Slack:

If you're using an accumulator over a collection, beware of storing anything from individual documents as that will force the entire document in each case to stay in memory

Could this be the problem? Could we store values or copies instead of items?

martindholmes avatar Jul 12 '25 04:07 martindholmes

Doesn't look like any pointers to nodes are being kept in the accumulator, just copies of strings, so that's not it. The only real differences between the 1.4 release branch and the dev branch are the change from document-uri() to base-uri() and the change from Saxon 10 to Saxon 12. Action on MDH to first try upgrading Saxon to the latest release in dev to see if perhaps a memory leak has been fixed in the lat year; failing that, test in a branch of release-1.4 to see if switching to Saxon 12 and base-uri() reproduces the problem there, in which case we might migrate back to Saxon 10.

martindholmes avatar Aug 15 '25 17:08 martindholmes

@joeytakeda Well, here's a thing: I switched in Saxon 10, and with no other changes, the build works fine.

So this is a problem with Saxon 12. Next I'll try switching in the latest version of Saxon 12, in case they've fixed a memory leak in the last year.

martindholmes avatar Aug 15 '25 22:08 martindholmes

Well, it seems to work fine in Saxon 12.8, the latest version. So I'm going to move the dev branch forward to 12.8.

martindholmes avatar Aug 15 '25 23:08 martindholmes

Working on this in branch iss-323-update-saxon so I can test it properly -- a quick test caused an exception in the xml resolver, so I want to be a bit cautious.

martindholmes avatar Aug 15 '25 23:08 martindholmes

OK, tests are working in the branch. My next test will be to re-pull this branch's content to replace the BBTI staticSearch folder, and to run the BBTI build again. If that works OK, we're done and I'll create a PR.

martindholmes avatar Aug 15 '25 23:08 martindholmes

Turns out it fails in 12.8 too. The success I thought I had may have been a fluke or an illusion. Testing again with Saxon 10 now.

martindholmes avatar Aug 16 '25 23:08 martindholmes

The build works reliably with Saxon 10, but fails reliably (with one exception, which might have been a fluke) with Saxon 12.5 or 12.8.

@joeytakeda I vote we revert to Saxon 10. There's nothing wrong with it, and 12 obviously is less memory-efficient, and has been so for at least a year and four point releases.

martindholmes avatar Aug 17 '25 05:08 martindholmes

Really interesting! Thanks for testing this, Martin. I'm curious if 11 would have the same issues, since it looks like Saxon 12 did introduce a number of optimizations, which could possible be the source of these issues? https://www.saxonica.com/html/documentation12/changes/v12/optimization.html

In any case, I feel mildly uneasy reverting back two major versions, if only because it means we'll have to ensure that these memory issues (which are arguably a bit of an edge case) don't happen for every other major version (especially since we may want to upgrade if, e.g. #300 was resolved).

An alternative solution could be that projects encountering this memory issue may want to pass their own Saxon to staticSearch (e.g. ant -DssSaxon=path/to/saxon10.jar). That property is already there, and we could document it a bit better and confirm that it's usable.

What do you think?

joeytakeda avatar Aug 18 '25 05:08 joeytakeda

@joeytakeda I agree on testing Saxon 11, and the results will be interesting to see. If there were any history of vulnerabilities in Saxon 10 I would also be nervous about reverting, but there have never been any to my knowledge, and lots of other projects still use it. What I would like to do is to figure out whether what we're seeing is an actual bug in Saxon, or whether recent versions of Saxon have just decided to sacrifice memory efficiency for some other gains deemed more useful.

The other question, of course, is how many other large projects will hit the same problem. I haven't yet tested projects like LEMDO, MoEML, and ColDesp (mostly because they're on svn so it's harder to play around in a branch). But I think that's the next thing to try.

martindholmes avatar Aug 25 '25 22:08 martindholmes

@joeytakeda Tested with Saxon 11.7, and it works fine. So whatever the problem is, it was introduced with Saxon 12. Since 11.7 is still supported, we could move back to that, and raise an issue on the Saxon repo to see if there actually is some kind of bug that could be diagnosed and fixed.

martindholmes avatar Sep 01 '25 04:09 martindholmes

I've now run into this myself: a project with 276 documents (40002 token files) raised OOM errors once I went to v2. But switching the Saxon from staticSearch's (12) to the local one (10) in the antcall works perfectly well, so I think we're right that it's a 12 issue. I'm in favour of switching to 11.

joeytakeda avatar Nov 21 '25 21:11 joeytakeda

Let's do this tomorrow, then.

276 documents is a very small number, relatively speaking.

martindholmes avatar Nov 25 '25 00:11 martindholmes

Opening a new temporary ticket (#361) for this rollback to Saxon 11; then when we're comfortable that's working OK in dev, we can return to this ticket and start figuring out what's going wrong in Saxon 12.

martindholmes avatar Nov 25 '25 18:11 martindholmes