profiler icon indicating copy to clipboard operation
profiler copied to clipboard

Idle java threads fill entire timeline

Open mcomella opened this issue 3 years ago • 6 comments

See also https://bugzilla.mozilla.org/show_bug.cgi?id=1766532 for required platform changes.

Idle java threads (except for the main thread) are unexpectedly filling the timeline while idle:

image

See this profile: https://share.firefox.dev/3vOqsBy

We should do what we do for the java main thread where the idling call doesn't appear on the timeline. The two methods that seem to trigger this are:

  • java.lang.Object.wait
  • org.mozilla.gecko.mozglue.GeckoLoader.nativeRun

One issue is that there may be legitimate uses of Object.wait that we want to see in the timeline but using it is discouraged over higher-level concurrency APIs so it's probably okay.

┆Issue is synchronized with this Jira Task

mcomella avatar Apr 26 '22 21:04 mcomella

For C++ threads we have fixed this problem by measuring elapsed CPU time between samples, and making the graph CPU-usage aware. If our Java sampling could sample elapsed CPU times as well, this might be a good way to fix it.

Another more far-reaching way to address this would be to integrate Java thread sampling more with C++ sampling; if all the Java threads were registered with the C++ sampler, we could, for each Java thread, sample the Java and C++ stacks in the same sample and just use the same CPU delta for both.

mstange avatar Apr 26 '22 21:04 mstange

I discovered a naive way to fix it. We could change the categories of the methods to idle:

I'm thinking we can land this naive change (I'm currently working on it) and we can file a follow-up for a more robust solution like mstange suggested.


Since the sampler in the platform code also needs to be updated for the naive solution, I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1766532 as a companion bug.

mcomella avatar Apr 26 '22 21:04 mcomella

I discovered a sample non-idling use case with my naive suggestion implemented https://share.firefox.dev/3xVyoUb:

image

The code between these two peaks is actually calling GeckoResult.poll -> Object.wait (i.e. it's blocking and not just idle). The naive change is actually very misleading. It looks like this code is doing nothing when it could actually be a source of slowdown for performance issues. I think we should do as mstange suggested.

mcomella avatar Apr 26 '22 22:04 mcomella

Since the non-naive solution is more complex, I won't be working on it right now.

mcomella avatar Apr 26 '22 22:04 mcomella

What I suggested will actually have the same result: The blocking wait doesn't take up any CPU time, so the graph will show 0% CPU. The same happens for C++ threads. I was initially very concerned that this would cause confusion, but so far it's been fine, because C++ threads have "jank" bars if their event loop is blocked for more than 50ms.

mstange avatar Apr 26 '22 22:04 mstange

So I wonder if all solutions here require the addition of jank markers to Java threads. Is it even possible to get event loop information for these threads?

mstange avatar Apr 26 '22 22:04 mstange