opencensus-java icon indicating copy to clipboard operation
opencensus-java copied to clipboard

100% cpu usage by OpenCensus.Disruptor-0

Open danielnorberg opened this issue 6 years ago • 11 comments

Please answer these questions before submitting a bug report.

What version of OpenCensus are you using?

0.17.0

What JVM are you using (java -version)?

java version "11.0.1" 2018-10-16 LTS Java(TM) SE Runtime Environment 18.9 (build 11.0.1+13-LTS) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.1+13-LTS, mixed mode)

What did you do?

Normal opencensus use - low volume workload.

What did you expect to see?

Low cpu usage.

What did you see instead?

The OpenCensus.Disruptor-0 thread spinning and using a whole core even when the application is otherwise idle.

This manifested after we upgraded from Java 8 to Java 11.

danielnorberg avatar Nov 20 '18 07:11 danielnorberg

@danielnorberg @steveniemitz I've made a new SNAPSHOT release which included https://github.com/census-instrumentation/opencensus-java/pull/1605. Can you try it and see if it resolved this issue?

The SNAPSHOT release is under https://oss.sonatype.org/content/repositories/snapshots/io/opencensus/. For example https://oss.sonatype.org/content/repositories/snapshots/io/opencensus/opencensus-api/0.18.0-SNAPSHOT/.

To access it, please add the following lines to ~/.m2/settings.xml:

<profiles>
  <profile>
     <id>allow-snapshots</id>
        <activation><activeByDefault>true</activeByDefault></activation>
     <repositories>
       <repository>
         <id>snapshots-repo</id>
         <url>https://oss.sonatype.org/content/repositories/snapshots</url>
         <releases><enabled>false</enabled></releases>
         <snapshots><enabled>true</enabled></snapshots>
       </repository>
     </repositories>
   </profile>
</profiles>

Then you should be able to use 0.18.0-SNAPSHOT as the OpenCensus version in your gradle/pom file.

songy23 avatar Nov 21 '18 22:11 songy23

With opencensus-impl-0.18.0-20181121.224929-2.jar the OpenCensus.Disruptor-0 thread cpu usage has dropped down to around %10 for me on java 11.

danielnorberg avatar Nov 22 '18 05:11 danielnorberg

@danielnorberg good to hear that. I think we should increase from 1ms to 4 or 8ms the sleeping time. Need to run some benchmarks to prove does not affect the high qps jobs. If you can try to increase that to 4 and 8 and see the impact would be great.

bogdandrutu avatar Nov 22 '18 19:11 bogdandrutu

@bogdandrutu I tried increasing the sleep from 1ms to 4ms and 8ms and got ~4% and ~2% cpu usage, respectively, when idle.

danielnorberg avatar Nov 26 '18 02:11 danielnorberg

Do you think for the moment until we do a better fix 4% is acceptable? If yes I can do a PR or you can send me a PR. Thanks

bogdandrutu avatar Nov 26 '18 03:11 bogdandrutu

I've been running a custom build that replaced SleepingWaitStrategy with BlockingWaitStrategy, which completely removes the CPU overhead of the disruptor thread. However is does add a very slight overhead to the actual tracing calls (since they now need to synchronize/notify the disruptor). For my use-case this overhead was acceptable and better than burning a lot of CPU.

steveniemitz avatar Nov 26 '18 16:11 steveniemitz

@danielnorberg @steveniemitz that would be another option. we can do that by having a manifest file to configure the "low-overhead" version which uses the BlockingWaitStrategy (or an environment variable). Do you prefer this? If so what do you think is better manifest or env variable? Also I will work to build a better implementation that will not require this.

bogdandrutu avatar Nov 26 '18 20:11 bogdandrutu

I do like the idea of being able to configure this more easily. It seems like the SleepingWaitStrategy does have some kinks in it that could be worked out though with some more tuning? It seems like even 4% is a very high overhead for this that most service owners wouldn't want. It also seems like things running in GKE (or AppEngine) are significantly more effected than things that aren't, are there specific kernel modifications there that are causing these issues?

EDIT: To directly address your question, I think I did see a PR around here that allowed configuring the strategy via a ServiceLoader (and manifest). That'd probably be the best solution.

steveniemitz avatar Nov 26 '18 20:11 steveniemitz

Filed an issue on disruptor to allow us to support a better waiting strategy (this is for my long term solution) https://github.com/LMAX-Exchange/disruptor/issues/246. This is mostly an FYI and to keep track of the progress here.

@steveniemitz that PR has not made too much progress. Happy to start a new PR for this if you think that can be a good short term solution (maybe can still be useful after I can implement my long term solution).

bogdandrutu avatar Nov 26 '18 23:11 bogdandrutu

Allowing users to choose the appropriate wait strategy seems like it would be appreciated by some, although I suspect many application developers might not be familiar enough with disruptor and busy-wait/park/yield strategies, contention costs etc to be comfortable with making this choice. I guess most users will want a non-surprising default that performs well enough in most cases.

Defaulting to BlockingWaitStrategy seems like a good choice that would mitigate the unexpected cpu usage of the consumer thread for applications with low-volume workloads. The need for producer threads to synchronize is unfortunate though. Seeing as we have a single consumer thread, perhaps we can get away with a cheaper signaling mechanism. Took a stab at that here https://github.com/census-instrumentation/opencensus-java/pull/1618.

danielnorberg avatar Nov 28 '18 04:11 danielnorberg

We've recently managed to hit a very similar issue to this one at my company. The difference is that we're running java8, the issue showed up when we updated from jdk:8u232 to jdk:8u242.

We're going to try to fix by updating the opencensus library.

xocasdashdash avatar Apr 03 '20 11:04 xocasdashdash