Hystrix icon indicating copy to clipboard operation
Hystrix copied to clipboard

Circuit breaker does not close even after remote service has recovered

Open utwyko opened this issue 7 years ago • 18 comments

We've had the following problem occur three times in about a month:

  • Remote service has issues and does not respond within the set Hystrix timeout period
  • Circuit breaker opens
  • Remote service recovers
  • Circuit breaker does not close as expected

We're running a service on two nodes. A case occurred where one node's circuit breaker properly closed, and one remained open, while both nodes are talking to the exact same remote service:

image

At the same time, the circuit breaker for a call to another endpoint on that same remote service remained open, with no requests at all bypassing the circuit breaker:

image

We have hystrix.command.default.circuitBreaker.sleepWindowInMilliseconds set to 1000, which I would expect to let one request per second bypass the circuit breaker.

I've tried to reproduce this in an integration test, where I simulated the remote service timing out/producing errors, and every single time the circuit breaker successfully opens and closes. So unfortunately at this moment I am unable to provide an exact reproduction path.

Some other information

  • This occurred in a medium-traffic environment, where the remote services were called with a frequency of about 25 to 300 requests/sec
  • Using Hystrix 1.5.12, Spring-Cloud Netflix Dalston.SR1 and Javanica Hystrix Annotations.

utwyko avatar Jul 19 '17 08:07 utwyko

We might be experiencing the exact same issue but it's really hard to reproduce (it only took place once in PROD and twice under "extreme" load testing).

I'll try to find the time for more digging

(we're currently using 1.5.13)

gszeliga avatar Oct 04 '17 15:10 gszeliga

We're having the same issue. Hystrix version is 1.5.12 and we're also using Javanica.

jgaribaldi avatar Oct 31 '17 13:10 jgaribaldi

I've spent a little bit more trying to reproduce the case without any success. Having said that, going back to the logs, I've seen that there's been an important refactor around HystrixCircuitBreaker introduced in version 1.5.12

Now, I'm gonna wild guess here, but is it possible that there's a race condition between markSuccess and line 192 as part of metrics.getHealthCountsStream() subscriber? There's a small window where:

  • A transition HALF-OPEN -> CLOSE happens
  • In line 192, the status trip from CLOSE -> OPEN
  • The metrics stream gets reset (line 205)

Why I am mentioning this? Because in my specific scenario, the CB gets stuck between an OPEN -> HALF-OPEN -> CLOSE to immediately trip to OPEN and so on.

gszeliga avatar Nov 07 '17 16:11 gszeliga

Hi, we're having the same issue (version 1.5.12), the circuit randomly OPENS AND NEVER CLOSES AGAIN, even if the backend service work properly.

This is our config (we know is very conservative):

  • metrics.rollingStats.timeInMilliseconds = 10000
  • hystrix.command.default.circuitBreaker.requestVolumeThreshold = 10
  • hystrix.command.default.circuitBreaker.sleepWindowInMilliseconds = 1000
  • hystrix.command.default.circuitBreaker.errorThresholdPercentage = 50

No more "success", "badRequest" or "failures". In fact, the command no longer get's executed (we can view this on other dashboards such as RestClient metrics).

Suggestion: it would be useful for debugging, if the circuit "open" and "close" events can be logged on HystrixEventNotifier.markEvent(...)

For now, we're relaxing the configuration, downdgrading to 1.5.11 and monitor if the issue persists.

image

francovitali avatar Dec 04 '17 14:12 francovitali

Update: with the downgrade we no longer experience the circuit blocked on OPEN state.

However, randomly, when a backend service begin to fail, once the circuit transitions to CLOSE, the behavior changes and the circuit begins to OPEN and CLOSE repeatedly (in a "twitchy" state, as if the stats were not reseted). We have to re-deploy the servers for the circuits to keep CLOSED with the same backend error rate.

There are a lot of opened issues and PRs addressing similar issues.

Are there some guidelines (besides the documentation) about safe configuration values for Circuit Breakers?

francovitali avatar Dec 28 '17 16:12 francovitali

I've seen this problem occur as well and downgrading to 1.5.11 fixed it. Also seems to happen when the backend service is slow (I simulated with a fixed delay of ~1500ms)

JayeshS avatar Jan 03 '18 02:01 JayeshS

I'm also having this problem in production with version 1.5.13. In our logs I can see the following:

  • A dependency became unhealthy and all the requests to it started returning a 503
  • For 10 minutes I can see hystrix trying to test if the dependency is back to normal. That is, after several instances of the short-circuited and no fallback available error I see one instance of the 503 error. The time between the 503 is aprox. the value of the circuitBreaker.sleepWindowInMilliseconds setting (5 seconds).
  • After 10 minutes of this behavior we only see the short-circuited and no fallback available error. It's almost like hystrix gave up trying to see if the dependency was healthy and the circuit stayed permanently open.

After all that, we had to restart the service to see the circuit closed again. We haven't tried downgrading to 1.5.11 but we plan to do that soon.

asolanaruiz avatar Feb 10 '18 23:02 asolanaruiz

Probably regression in 1.5.13 caused by https://github.com/Netflix/Hystrix/commit/a8203446b90333f005c88cacf91a5dc8faf07c1c#diff-82a974c5de99c7b7fa59df2c2b823ae1R385 Also see https://github.com/Netflix/Hystrix/issues/1723

bedrin avatar Feb 14 '18 10:02 bedrin

Is there a solution for this issue? Right now my only option is to disable circuitBreaker :(

litalk avatar Mar 20 '18 07:03 litalk

Sound like this https://github.com/Netflix/Hystrix/pull/1640

Are there any maintainers looking into this?

jiacai2050 avatar Jun 11 '18 10:06 jiacai2050

I have the same issue. I would like to see the circuit breaker closed again. Will that feature be implemented?

davidvara avatar Jun 18 '18 05:06 davidvara

@davidvara Just let you know, I decide to downgrade to 1.5.11 (1.5.12 introduce a big refactor of circuit) and see what happens.

jiacai2050 avatar Jun 18 '18 13:06 jiacai2050

We ran into this. This is a very serious issue. It basically is a showstopper for using Hystrix. Hopefully it will be addressed soon (or the bad code backed out).

petropolis avatar Jul 11 '18 17:07 petropolis

+1

godofwharf avatar Nov 12 '18 05:11 godofwharf

Update. after downgrading to 1.5.11, I haven't seen this issue so far.

jiacai2050 avatar Nov 12 '18 07:11 jiacai2050

Does anyone know if this is fixed under the last release, 1.5.18 of 16 Nov 2018?

So far, it seems the best approach is to downgrade to 1.5.11, isn't it?

antdavidl avatar Feb 23 '19 07:02 antdavidl

1.5.11 was rereleased as 1.5.18, so they're the same: https://github.com/Netflix/Hystrix/releases/tag/v1.5.18

And Hystrix is no longer in active development after this release: https://github.com/Netflix/Hystrix#hystrix-status

breun avatar Feb 23 '19 08:02 breun

Hello,

We are using hystrix-core:jar:1.5.6 and experiencing the same issue.

Error: com.netflix.hystrix.exception.HystrixRuntimeException:xx.xx short-circuited and fallback disabled

It doesn't look like there was a problem with 1.5.12. does it ?

This problem is with older version of hystrix too. Unfortunately, I am unable to reproduce the issue in my local environment.

rx091v avatar Dec 11 '20 11:12 rx091v