Hystrix
Hystrix copied to clipboard
does hystrix acquire some resources to start up it’s thread pool ?
we have set the minimum thread pool size as 20 and Max is 100.
but at the start of request processing , we are getting below exception:
The first few exceptions said: Task java.util.concurrent.FutureTask@1ad6194b rejected from java.util.concurrent. ThreadPoolExecutor@6a085fce[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]
next few exceptions were about GroupService:GetGroupMembers fallback execution rejected
And finally, the circuit was opened and we started getting these errors: Hystrix circuit short-circuited and is OPEN
we are confused as even though we had core size is 20, but actual pool size was 1 in the exception logs.
The first command to execute in a threadpool will set it up - there shouldn't be a period where it's set up improperly.
That log makes it appear like the threadpool is only configured with a single thread. How are you configuring it? If you have the config stream enabled (https://github.com/Netflix/Hystrix/wiki/Metrics-and-Monitoring#configuration-stream), can you post the relevant output of how this threadpool is configured?
We have configured thread pool with multithreading behaviour. As of now I am providing hystrix.stream data for thread pool , please let me know if this would be fine to have regarding this issue?
data: {"type":"HystrixThreadPool","name":"group-service-thread-pool","currentTime":1493747155308,"currentActiveCount":0,"currentCompletedTaskCount":97065375,"currentCorePoolSize":20,"currentLargestPoolSize":100,"currentMaximumPoolSize":100,"currentPoolSize":20,"currentQueueSize":0,"currentTaskCount":97065375,"rollingCountThreadsExecuted":1616,"rollingMaxActiveThreads":5,"rollingCountCommandRejections":0,"propertyValue_queueSizeRejectionThreshold":5,"propertyValue_metricsRollingStatisticalWindowInMilliseconds":10000,"reportingHosts":1}
@JagmohanSharma , are you initiating a significant number of requests before the thread pools are "warmed"" up?
@bltb Yes we are handling around ~150 rps with hystrix protected way. And this issue occurred when we deployed a new version of our service and activated that. So it got requests in a good numbers at that moment and threw these error for period of 4 seconds. and after that things got normal with thread pool.
@JagmohanSharma, when you say "good numbers", how many requests are you receiving in 4 seconds?
@bltb it was around ~150 rps on a single node when we activated the new version for receiving the requests.
I just ran a test which generally simulates your case: thread pool with min=20/core = 100 and then I gave it 100 concurrent calls.
1494370006900 : main constructing the pool to launch commands from concurrently
1494370006902 : main about to launch the 100 commands concurrently
1494370006920 : main about to start the await
1494370007150 : hystrix-Unit-7 starting
1494370007150 : hystrix-Unit-2 starting
...
1494370007164 : pool-1-thread-99 done with 98
1494370007162 : pool-1-thread-92 done with 91
1494370007164 : main done with await
is some output from a sample run on my Macbook Pro.
The implication is that it took around 250ms to get all of the Hystrix data structures set up for the first command execution, but there was no cost beyond the first request.
If you're able to provide a similar test that demonstrates your issue, that'd be valuable, as I can't currently replicate the symptoms you've described.
@mattrjacobs this can be reproduced in any hystrix enabled services. As when I start the application and just after that start hitting the endpoint with 100 concurrent calls. which are protected by hystrix with these properties for commandKey
hystrix.threadpool.getData.coreSize=20 hystrix.threadpool.getData.allowMaximumSizeToDivergeFromCoreSize=true hystrix.threadpool.getData.maximumSize=100.
I got exceptions like below
2017-05-25 18:05:09.282 ERROR 20543 --- [io-8080-exec-26] o.a.c.c.C.[.[.[/].[dispatcherServlet] : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is com.netflix.hystrix.exception.HystrixRuntimeException: getData could not be queued for execution and fallback failed.] with root cause
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@5aedf205 rejected from java.util.concurrent.ThreadPoolExecutor@189572bb[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
and next few request got failed with this.
2017-05-25 18:05:09.283 ERROR 20543 --- [io-8080-exec-16] o.a.c.c.C.[.[.[/].[dispatcherServlet] : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is com.netflix.hystrix.exception.HystrixRuntimeException: getData could not be queued for execution and fallback failed.] with root cause
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@448b9a6e rejected from java.util.concurrent.ThreadPoolExecutor@189572bb[Running, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 0]
But this also got weird as next exception had this:
java.util.concurrent.ThreadPoolExecutor@189572bb[Running, pool size = 8, active threads = 8, queued tasks = 0, completed tasks = 0]
and then next exception had:
java.util.concurrent.ThreadPoolExecutor@189572bb[Running, pool size = 5, active threads = 5, queued tasks = 0, completed tasks = 0]
with reduced pool size.
I think when we just start hitting hystrix protected endpoint just after starting the application under very high load (200+ req/sec), the deployed instance of service doesn’t get enough time to initialise its thread thus leading to this issue.
In our case we are facing this issue when we switch(activate) the updated service version under very high load.
@spencergibb @mattrjacobs This more looks like issue with ThreadPoolExecutor as Threadpool gets created threads based on tasks being submitted to it.
in our java util concurrent ThreadPoolExecutor , we have option to pre start all core pool size threads. So if we override the HystrixConcurrencyStrategy.getThreadPool() method and call ThreadPoolExecutor.prestartAllCoreThreads() after creating ThreadPoolExecutor in HystrixConcurrencyStrategy.getThreadPool() method. This Starts all core threads, causing them to idly wait for work.
But now issue is as Hystrix Thread pool only get initialise when it receive first user request and that time only this will make call to ThreadPoolExecutor.prestartAllCoreThreads() as we do not have any provision to preStartHystrixThreadPool as this is only being created when first user request arrives. Can we take this as enhancement to provide this provision to preStartHystrixThreadPool as well before first user request?
@mattrjacobs In these situation like bursting of requests as we can use queue or increase the pool(maximum pool size) , do we have any other option to accommodate this by Hystrix thread pool?
As Queues can also be used in cases when burst of requests need to be accommodated which is also being used at netflix in these kind of usecases. But do we have any other option or planned enhancement for handling such pre initialisation of hystrix thread pool if required?
Can you please help and suggest approach here as we are facing such situation every week?
We use queues very sparingly (I believe in 1 thread pool out of ~100) in the application I help operate.
One idea would be to build a version of Hystrix that prestarts all core threads and see how that performs.
This discussion seems similar to #1596 and also https://github.com/Netflix/Hystrix/issues/1554#issuecomment-298792781.
Having done some reasonably in-depth experiments using "queues" and Hystrix; it was discovered that, in order to handle bursty traffic, the Apache Tomcat ThreadPoolExecutor
implementation worked better than the Java default. We just extended HystrixConcurrencyStrategy
.
Thanks for the tip @bltb ! Will point people towards that if that comes up again
you can see JDK ThreadPoolExecuter.addWorker(Runnable firstTask, boolean core) method. because:
- reject is using ctl
- But print ThreadPool.toString is using workers.size()
ctl increase happend before workers.add So it's just a concurrency issue when printing ThreadPool.
i think: The essence of the problem is that the performance of the application is poor when it is first started, because various resources are not ready. You can try to warm up the application
@mattrjacobs, we had similar concerns on how Hystrix handles bursty traffic beyond coreSize threads without a queue/buffer. Assuming a scenario where queueSize (10) is full and maximumSize (100) is large with no queueSize (0), does Hystrix gracefully handle all requests upto maximumSize or reject any at all while new thread creations take time? Was looking for a test case around this scenario.