cf-abacus icon indicating copy to clipboard operation
cf-abacus copied to clipboard

Bridge healthcheck stays in failed state until a successful event is received

Open amhuber opened this issue 5 years ago • 3 comments

The bridge healthcheck logic at https://github.com/cloudfoundry-incubator/cf-abacus/blob/master/lib/utils/bridge/src/healthchecker.js sets isFailing any time a failure event is received, but that state will only ever get changed by a subsequent success event. If a failure has occurred and then a success event isn't received for a lengthy period (for example, no apps have been stopped or started) then the healtcheck will stay in a failed state permanently until the bridge is restarted.

It would make more sense to reset isFailing after the threshold has expired.

amhuber avatar Aug 23 '18 23:08 amhuber

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/160008906

The labels on this github issue will be updated when the story is started.

cf-gitbot avatar Aug 23 '18 23:08 cf-gitbot

Hello @amhuber, Looking at the code I can tell that a failure can occurs only when 'usage.failure' is emitted. This happens when a particular Event from CloudController it not accepted by Abacus. In this case the bridge is retrying this same Event, no other events form Cloud Controller are taken into account. The healthcheck is staying in a failed state, it will turn into healthy state when the Event is successfully accepted. And then the bridge will read other events (if have any) from Cloud Controller. The code has been refactored since the time of creation the issue. Can you please describe how did you reproduce your scenario.

denicaM avatar Nov 15 '18 08:11 denicaM

The relevant code was just moved in the refactor but it doesn't appear to have changed significantly. As far as I can see, this is what is happening:

  • In https://github.com/cloudfoundry-incubator/cf-abacus/blob/master/lib/utils/healthmonitor/src/index.js#L10-L17 any failure will set isFailing to true
  • The only way to change isFailing to false is for the onSuccess event to fire (https://github.com/cloudfoundry-incubator/cf-abacus/blob/master/lib/utils/healthmonitor/src/index.js#L19-L22)
  • The health check will report as failed if isFailing is true (https://github.com/cloudfoundry-incubator/cf-abacus/blob/master/lib/utils/healthmonitor/src/index.js#L29)

Where this is an issue is in environments where we don't have any services in CF. If there is an issue with the CC then a failure event can be triggered in the abacus-services-bridge, but since there are no services there will never be an onSuccess event, so the bridge healthcheck reports as failed forever until the bridge is restarted. The only resolution on our end is to just not monitor the abacus-services-bridge healthcheck in environments that don't have any services, but it still seems like the logic could be improved in the healthcheck.

amhuber avatar Nov 15 '18 16:11 amhuber