java-dogstatsd-client icon indicating copy to clipboard operation
java-dogstatsd-client copied to clipboard

Errors in NonBlockingStatsDClient.QueueConsumer are not recoverable

Open deadok22 opened this issue 6 years ago • 1 comments

QueueConsumer does not recover from java.lang.Error instances and there's no API to re-schedule another QueueConsumer. That results in the message queue getting filled up and no metrics getting emitted.

I had an application instance that had an OutOfMemoryError thrown in QueueConsumer. Here's the stack trace of the thread that was supposed to run QueueConsumer:

 StatsD-pool-1-thread-1 tid=23 [WAITING] [DAEMON]
sun.misc.Unsafe.park(boolean, long) Unsafe.java
java.util.concurrent.locks.LockSupport.park(Object) LockSupport.java:175
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() AbstractQueuedSynchronizer.java:2039
java.util.concurrent.LinkedBlockingQueue.take() LinkedBlockingQueue.java:442
java.util.concurrent.ThreadPoolExecutor.getTask() ThreadPoolExecutor.java:1067
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1127
java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:617
java.lang.Thread.run() Thread.java:745

Here are some things I think we could do to mitigate that:

  • Minimize the number of allocations in QueueConsumer#run. In particular, packet encoding could be performed in the client threads
  • Add API for re-scheduling the failed QueueConsumer
  • Handle OutOfMemoryError (are there other recoverable errors?) in QueueConsumer

deadok22 avatar Sep 11 '18 08:09 deadok22

I ran into a very similar problem with com.timgroup.statsd.NonBlockingStatsDClient.StatsDSender. Just catching OutOfMemoryError in the run method would have handled my case.

aburgoyne avatar Mar 19 '19 20:03 aburgoyne