ocaml icon indicating copy to clipboard operation
ocaml copied to clipboard

Performance Regression: GC.stat went from < 1s to ~50s in 4.14 -> 5.0

Open stroxler opened this issue 2 years ago • 7 comments
trafficstars

When migrating to Ocaml 5.0.0, the Pyre team encountered a major slowdown in Gc.stat performance when we were type checking a large Python codebases. In that specific run it went from taking well under 1 second to about 50.

We moved the call - which was used for telemetry - behind a --debug flag as a mitigation: https://github.com/facebook/pyre-check/commit/91eba63147d202fd272c15d01da0d39c3c1a71fc

For our use case it turns out we could use Gc.quick_stat which is better, but it was still suggested in https://discuss.ocaml.org/t/ocaml-5-0-0-in-pyre-mostly-smooth-sailing-gc-stat-regression/11205 that we file a ticket in case the magnitude of the slowdown comes as a surprise.

It appeared to me that when type-checking smaller codebases there was no visible regression, so I think that the ocaml process has to be using quite a bit of ram (likely several GB) before the problem appears. Unfortunately there is no open-source codebase I know of large enough to reproduce the slowdown inside of Pyre.

stroxler avatar Jan 18 '23 22:01 stroxler

Thanks for the report. It was good to know that the migration to 5.0.0 went well except for this one regression.

Going from 1s to 50s is worrying. It is a pity that there isn't open-source code that exhibits the issue.

IINM there is no parallelism yet, and hence, this is not related to #11589 and #11548. Gc.stat does 3 full major cycles and then does a Gc.quick_stat. Given the 50x slowdown, is it possible that the GC speed is so slow that the program ends up using most of the RAM and swap memory is being used?

CCing @sadiqj @damiendoligez who may be interested in this issue.

kayceesrk avatar Jan 19 '23 03:01 kayceesrk

There is indeed no (thread-based) parallelism.

We make heavy use of multiprocessing with a relatively primitive shared memory implementation (based on serializing all Ocaml values before passing to a lock-free C hash table) but the Gc.stat call is only in the main process at program exit, by that point all workers have already exited.

I don't think that it's using swap. Aside from Gc.stat there was only a minimal performance change (a slowdown of something like 5%, which is very likely just due to slight changes in multiprocessing overhead).

I could do a more explicit check if that seems like the most likely diagnosis

stroxler avatar Jan 19 '23 12:01 stroxler

What would be great is getting a trace from the GC when it hits Gc.stat. You can do this by using olly from https://github.com/sadiqj/runtime_events_tools

Alternatively if you've got some steps I can run to reproduce it, I can grab one.

sadiqj avatar Jan 19 '23 13:01 sadiqj

Hey, sorry for the delay on this, I had some configuration issues getting these tools set on my internal server.

Here's the output of an olly trace of the relevant type check (this is running against current Pyre, which doesn't include the Gc.stat call; if you need a trace where Gc.stat is enabled I can get that for you as well): https://gist.github.com/stroxler/58a6a76d651728ecdc76a4f634dff089

It's not valid json; as far as I can tell the olly output has a trailing comma and no closing ], is that expected?

stroxler avatar Jan 23 '23 21:01 stroxler

Thanks Steven. Yes, the trace isn't valid json but that's fine for chrome tracing viewer or perfetto, they'll load it fine.

It seems there are major slices hitting 10+ seconds. Do you know how big the major heap gets for this?

On 5.0 Gc.stat requires cycling the major heap enough that all garbage is actually swept. For large heaps this could be quite expensive. Still, if you find an open source codebase or know how to synthesize one that would trigger similar behaviour from Pyre - it would be interesting to have a look at where the major collector is spending all of it's time.

Since you're spending a significant chunk of your time marking, you will probably benefit from https://github.com/ocaml/ocaml/pull/11827 , which should be merged soon.

One other thought. 4.14 does less work for Gc.stat than 5.0 - should be at most an order of magnitude though. Are you aware of significant memory usage differences between 4.14 and 5.0? Do you make much use of ephemerons/weak?

sadiqj avatar Jan 24 '23 09:01 sadiqj

This issue has been open one year with no activity and without being identified as a bug. Consequently, it is being marked with the "stale" label to see if anyone has comments that provide new information on the issue. Is the issue still reproducible? Did it appear in other contexts? How critical is it? Did you miss this feature in a new setting? etc. Note that the issue will not be closed in the absence of new activity: the "stale" label is only used for triaging reason.

github-actions[bot] avatar May 30 '25 04:05 github-actions[bot]

@stroxler could you provide an update on whether https://github.com/ocaml/ocaml/pull/11827 improved things. If possible re-run the original test on a newer version of OCaml, or provide reproduction steps.

tmcgilchrist avatar Jun 09 '25 23:06 tmcgilchrist