Hystrix
Hystrix copied to clipboard
Circuit breaker does not close even after remote service has recovered
We've had the following problem occur three times in about a month:
- Remote service has issues and does not respond within the set Hystrix timeout period
- Circuit breaker opens
- Remote service recovers
- Circuit breaker does not close as expected
We're running a service on two nodes. A case occurred where one node's circuit breaker properly closed, and one remained open, while both nodes are talking to the exact same remote service:
At the same time, the circuit breaker for a call to another endpoint on that same remote service remained open, with no requests at all bypassing the circuit breaker:
We have hystrix.command.default.circuitBreaker.sleepWindowInMilliseconds
set to 1000
, which I would expect to let one request per second bypass the circuit breaker.
I've tried to reproduce this in an integration test, where I simulated the remote service timing out/producing errors, and every single time the circuit breaker successfully opens and closes. So unfortunately at this moment I am unable to provide an exact reproduction path.
Some other information
- This occurred in a medium-traffic environment, where the remote services were called with a frequency of about 25 to 300 requests/sec
- Using Hystrix 1.5.12, Spring-Cloud Netflix Dalston.SR1 and Javanica Hystrix Annotations.
We might be experiencing the exact same issue but it's really hard to reproduce (it only took place once in PROD and twice under "extreme" load testing).
I'll try to find the time for more digging
(we're currently using 1.5.13)
We're having the same issue. Hystrix version is 1.5.12 and we're also using Javanica.
I've spent a little bit more trying to reproduce the case without any success. Having said that, going back to the logs, I've seen that there's been an important refactor around HystrixCircuitBreaker
introduced in version 1.5.12
Now, I'm gonna wild guess here, but is it possible that there's a race condition between markSuccess
and line 192 as part of metrics.getHealthCountsStream()
subscriber? There's a small window where:
- A transition
HALF-OPEN -> CLOSE
happens - In line 192, the status trip from
CLOSE -> OPEN
- The metrics stream gets reset (line 205)
Why I am mentioning this? Because in my specific scenario, the CB gets stuck between an OPEN -> HALF-OPEN -> CLOSE
to immediately trip to OPEN
and so on.
Hi, we're having the same issue (version 1.5.12), the circuit randomly OPENS AND NEVER CLOSES AGAIN, even if the backend service work properly.
This is our config (we know is very conservative):
- metrics.rollingStats.timeInMilliseconds = 10000
- hystrix.command.default.circuitBreaker.requestVolumeThreshold = 10
- hystrix.command.default.circuitBreaker.sleepWindowInMilliseconds = 1000
- hystrix.command.default.circuitBreaker.errorThresholdPercentage = 50
No more "success", "badRequest" or "failures". In fact, the command no longer get's executed (we can view this on other dashboards such as RestClient metrics).
Suggestion: it would be useful for debugging, if the circuit "open" and "close" events can be logged on HystrixEventNotifier.markEvent(...)
For now, we're relaxing the configuration, downdgrading to 1.5.11 and monitor if the issue persists.
Update: with the downgrade we no longer experience the circuit blocked on OPEN state.
However, randomly, when a backend service begin to fail, once the circuit transitions to CLOSE, the behavior changes and the circuit begins to OPEN and CLOSE repeatedly (in a "twitchy" state, as if the stats were not reseted). We have to re-deploy the servers for the circuits to keep CLOSED with the same backend error rate.
There are a lot of opened issues and PRs addressing similar issues.
Are there some guidelines (besides the documentation) about safe configuration values for Circuit Breakers?
I've seen this problem occur as well and downgrading to 1.5.11
fixed it.
Also seems to happen when the backend service is slow (I simulated with a fixed delay of ~1500ms)
I'm also having this problem in production with version 1.5.13
.
In our logs I can see the following:
- A dependency became unhealthy and all the requests to it started returning a
503
- For 10 minutes I can see hystrix trying to test if the dependency is back to normal. That is, after several instances of the
short-circuited and no fallback available
error I see one instance of the503
error. The time between the503
is aprox. the value of thecircuitBreaker.sleepWindowInMilliseconds
setting (5 seconds). - After 10 minutes of this behavior we only see the
short-circuited and no fallback available
error. It's almost like hystrix gave up trying to see if the dependency was healthy and the circuit stayed permanently open.
After all that, we had to restart the service to see the circuit closed again.
We haven't tried downgrading to 1.5.11
but we plan to do that soon.
Probably regression in 1.5.13 caused by https://github.com/Netflix/Hystrix/commit/a8203446b90333f005c88cacf91a5dc8faf07c1c#diff-82a974c5de99c7b7fa59df2c2b823ae1R385 Also see https://github.com/Netflix/Hystrix/issues/1723
Is there a solution for this issue? Right now my only option is to disable circuitBreaker :(
Sound like this https://github.com/Netflix/Hystrix/pull/1640
Are there any maintainers looking into this?
I have the same issue. I would like to see the circuit breaker closed again. Will that feature be implemented?
@davidvara Just let you know, I decide to downgrade to 1.5.11 (1.5.12 introduce a big refactor of circuit) and see what happens.
We ran into this. This is a very serious issue. It basically is a showstopper for using Hystrix. Hopefully it will be addressed soon (or the bad code backed out).
+1
Update. after downgrading to 1.5.11, I haven't seen this issue so far.
Does anyone know if this is fixed under the last release, 1.5.18 of 16 Nov 2018?
So far, it seems the best approach is to downgrade to 1.5.11, isn't it?
1.5.11 was rereleased as 1.5.18, so they're the same: https://github.com/Netflix/Hystrix/releases/tag/v1.5.18
And Hystrix is no longer in active development after this release: https://github.com/Netflix/Hystrix#hystrix-status
Hello,
We are using hystrix-core:jar:1.5.6 and experiencing the same issue.
Error: com.netflix.hystrix.exception.HystrixRuntimeException:xx.xx short-circuited and fallback disabled
It doesn't look like there was a problem with 1.5.12. does it ?
This problem is with older version of hystrix too. Unfortunately, I am unable to reproduce the issue in my local environment.