regal icon indicating copy to clipboard operation
regal copied to clipboard

Dig into memory issues

Open alysbrooks opened this issue 3 years ago • 9 comments

Running the test suite on Java 9 used to fail until I manually specified the Serial GC, which can have higher throughput at the cost of more latency, (see #31). While this appears to fix the issue in the short run, our test suites probably shouldn't be creating so much garbage that they trigger the OOM killer. Our CircleCI test runners have 4GB of memory.

It's probably worth confirming that the memory usage is indeed the culprit. It definitely appears to be, but we haven't actually measured memory usage much yet.

alysbrooks avatar Jul 07 '21 22:07 alysbrooks

I decided to do some initial usage measurement. On my machine with Java 9 and the process limited to two cores, it doesn't seem to use much memory.

Screenshot showing a graph of heap memory ranging from about 500 MB to 100 MB, with a total committed of just over 1 GB and a max of about 4.2 GB

We are creating quite a bit of garbage, so it's plausible that a test runner less powerful than my desktop could fall behind. This is what it looks like when I use the Epsilon GC, which does no garbage collection: Screenshot showing a sharply increasing graph of heap memory usage, from 5 to 10 GB

alysbrooks avatar Jul 07 '21 22:07 alysbrooks

I have tried some tricks and done some experiments:

  • I run on macOS with 16G memory/java8/clojure 1.10. It always works.
  • I run on ubuntu 16 with 4G memory/java8/clojure 1.10. It sometimes fails.

I tried to get GC statistics and use gceasy to find some clues. The command I used is

clojure -J-XX:+UseSerialGC -J-XX:+PrintGCDetails -J-XX:+PrintGCDateStamps -J-Xloggc:./gc-log.txt  -A:dev:test:test-check:cljs:instaparse:malli   -Sdeps '{:deps {org.clojure/clojure {:mvn/version "1.10.2"}
                  lambdaisland/kaocha-junit-xml {:mvn/version "RELEASE"}
                  lambdaisland/kaocha-cloverage {:mvn/version "RELEASE"}}}'   -m kaocha.runner   --plugin profiling   --plugin capture-output   --plugin junit-xml   --plugin kaocha.plugin.alpha/info   --print-env   --junit-xml-file test-results/kaocha/$(date +"%s").xml   --reporter documentation --plugin cloverage --codecov

The following part can generate GC logs.

-J-XX:+PrintGCDetails -J-XX:+PrintGCDateStamps -J-Xloggc:./gc-log.txt

Part of the GC report of my macOS experiment is here. (The graphs of successful experiment)

螢幕快照 2021-09-19 上午1 59 59 螢幕快照 2021-09-19 上午1 40 41

humorless avatar Sep 18 '21 17:09 humorless

Part of the GC report of my ubuntu experiment is here. (The graphs of failed experiment)

螢幕快照 2021-09-19 上午1 55 02 螢幕快照 2021-09-19 上午1 55 29

humorless avatar Sep 18 '21 18:09 humorless

From the graph, I think that the problem is caused by the default heap size: only 1G.

I changed it as 2G with -J-Xmx2g, and I tried with the command

clojure -J-XX:+UseSerialGC -J-Xmx2g -J-XX:+PrintGCDetails -J-XX:+PrintGCDateStamps -J-Xloggc:./gc-log.txt  -A:dev:test:test-check:cljs:instaparse:malli   -Sdeps '{:deps {org.clojure/clojure {:mvn/version "1.10.2"}
                  lambdaisland/kaocha-junit-xml {:mvn/version "RELEASE"}
                  lambdaisland/kaocha-cloverage {:mvn/version "RELEASE"}}}'   -m kaocha.runner   --plugin profiling   --plugin capture-output   --plugin junit-xml   --plugin kaocha.plugin.alpha/info   --print-env   --junit-xml-file test-results/kaocha/$(date +"%s").xml   --reporter documentation --plugin cloverage --codecov

Then, this time the GC report of ubuntu case looks like

螢幕快照 2021-09-19 上午2 23 17 螢幕快照 2021-09-19 上午2 23 34

humorless avatar Sep 18 '21 18:09 humorless

Thanks, @humorless! for looking into it. I suspect our runners might have had a similar issue as you did on the 4G system—I think the JVM caps the heap size to 25 percent of available memory.

alysbrooks avatar Sep 20 '21 13:09 alysbrooks

I open a PR to change default heap size, but there is still one test not working.

humorless avatar Sep 20 '21 14:09 humorless

I still want to keep this issue open, since I'm not sure our tests should be creating so much garbage. However, I think the urgency is lessened with @humorless' change, which should further reduce the odds of running out of memory.

alysbrooks avatar Sep 20 '21 15:09 alysbrooks

I fell down a rabbit hole and wrote a little memory profiler plugin for Kaocha. Generated tests are creating quite a lot of garbage, as expected.

:lambdaisland.regal.spec-gen-test                                                             4575.99MB (null)
:lambdaisland.regal.spec-gen-test/round-trip-test                                               79.69MB (lambdaisland/regal/spec_gen_test.clj)
:lambdaisland.regal.spec-gen-test/generated-forms-can-be-converted                               1.71GB (lambdaisland/regal/spec_gen_test.clj)
:lambdaisland.regal.spec-gen-test/round-trip-property                                            2.79GB (lambdaisland/regal/spec_gen_test.clj)

alysbrooks avatar Sep 21 '21 03:09 alysbrooks

This doesn't seem to be causing CI issues after I pushed a fix (and we dropped Java 8/9 anyway), so this is a pretty low priority (and not a good Clojurists Together candidate).

alysbrooks avatar Oct 28 '22 21:10 alysbrooks

I think it would be good to reduce Regal's memory usage in general or, if the problem is specific to how we're testing it, fix the tests so they don't use so much memory. However, I'm not sure the current level is problematic for most actual usage. I suppose it's possible some applications construct regular expressions on the fly (perhaps to do filtering requested by an end user) and Regal would probably be an appealing way to construct those. That's the main circumstance I would expect a library user to be creating and then destroying a large number of regular expressions.

alysbrooks avatar Jul 14 '23 21:07 alysbrooks

Probably worth noting that Regal doesn't really have bespoke types that would use a lot of memory. It deals with forms (nested vectors), strings, and compiled regular expressions. So the problem is probably the translation rather than any of the primary data types Regal uses being inefficient. That's important, IMO, because if the format Regal forms are normally stored in take up a lot of memory, that can't be freed, whereas inefficiencies in the intermediate forms is a garbage collector problem (one the JVM seems to handle fairly well after version 9 if you ensure the heap is a bit larger than the default).

My other thought is that I don't know how much overhead generators produce. That might also be exacerbating the problem somewhat. Generators are also a bit tricky to reason about. My sense is that we're generating perhaps 100x more regular expressions than most programs and they're about 10x as long and complicated, but I could see a certain subset of real-world programs actually using that many expressions. However, it's possible that we're closer to 1000x as many expressions and 100x as complicated, which might mean normal usage of regular expressions is negligible and intensive usage isn't usually problematic.

Note that since I'm talking about memory, I'm not talking about usage, just the complexity and quantity of compiled expressions. Generating 1000 moderately complex expressions you never use is (in this context) worse than having 10 pre-compiled expressions you apply 100 times/second.

alysbrooks avatar Jul 14 '23 21:07 alysbrooks

Ultimately, I plan to close this since the main issue (failing CI runs) has been addressed.

alysbrooks avatar Jul 14 '23 21:07 alysbrooks