traefik icon indicating copy to clipboard operation
traefik copied to clipboard

Adding passive health checks

Open Nelwhix opened this issue 1 year ago • 5 comments

What does this PR do?

Adding Passive Health Checks

Motivation

Fixes #11306

More

  • [ x] Added/updated tests
  • [ ] Added/updated documentation

Additional Notes

Nelwhix avatar Dec 17 '24 03:12 Nelwhix

My goal for how this will work will be

http:
  services:
    my-service:
      loadBalancer:
        servers:
        - url: "http://<private-ip-server-1>:<private-port-server-1>/"
           healthCheck:
             maxFails: 3
             failTimeout: 30

I added the implementation to just http services for now.

Nelwhix avatar Dec 17 '24 03:12 Nelwhix

Hello @Nelwhix and thanks for your contribution.

Before reviewing the content of the PR I wanted to discuss how this feature is configured. First of all, I really like the failTimeout and maxFails, it's very easy to understand. In the future, if needed, we could even add an additional recoverTimeout to define how long to wait before recovering a server, and this without breaking change.

Though, I'm wondering if we really want this to be configured at the server level. This would be the right spot if we wanted each server to have a different passive healthcheck config, or a mix of server with and without passive healthcheck under the same loadbalancer. I might be wrong but I have the feeling that most of the time we want to have a uniform config on the loadbalancer.

Did you considered moving this configuration one layer up in the loadbalancer?

jspdown avatar Mar 14 '25 09:03 jspdown

Hi @jspdown, I was trying to achieve nginx style passive health checks https://docs.nginx.com/nginx/admin-guide/load-balancer/http-health-check/

Nelwhix avatar Mar 14 '25 10:03 Nelwhix

Hello @Nelwhix and thanks for this contribution,

After some consideration, we think that the passive health check mechanism should be configured at the load balancer level, in line with the active health checker configuration (see https://doc.traefik.io/traefik/routing/services/#health-check). This means that, once configured, the passive health check will be enabled for all servers.

We also think that it should reuse the current status updater mechanism. This would make the passive health checker responsible for propagating the server status to its parent (the load balancer). It should also be responsible for updating the ServiceInfo, which is needed to update the server status on the UI, as well as the corresponding ServerUpGauge metric (see https://github.com/traefik/traefik/blob/master/pkg/healthcheck/healthcheck.go#L135). This means that nothing needs to be changed in the WRR code, as the passive health checker will be integrated in the same way as the active health checker.

Regarding the passive health check algorithm, having read the documentation (https://docs.nginx.com/nginx/admin-guide/load-balancer/http-health-check/), we propose that, once a server is down, we should wait for the FailureWindow duration to expire before sending another request to the server. This means that, once a server is considered down, the passive health checker will have to clear its failures slice, after which a timer should be set to update the server status to 'up' once the FailureWindow duration has expired.

Does it make sense? Are you ok with updating this pull request in that direction?

kevinpollet avatar Jun 04 '25 09:06 kevinpollet

@kevinpollet Okay Kevin, I will rework

Nelwhix avatar Jun 04 '25 23:06 Nelwhix

Hi @kevinpollet , I was able to implement most of your suggestions. except for the passive health checker to update the ServiceInfo and serverupguage. I tried to pass in the metrics and serviceinfo to the wrr.Balancer and then use the healthchecker to do p.balancer.info.update but that will lead to cyclic import errors. Please could you point me in a better direction?

Nelwhix avatar Jun 22 '25 19:06 Nelwhix

Hello @Nelwhix, and apologies for the delay.

It's a bit difficult to reason through the cyclic import issue without working directly with the code. Would it be okay with you if we push review commits while conducting the pull request review?

kevinpollet avatar Jul 24 '25 07:07 kevinpollet

Okay @kevinpollet

Nelwhix avatar Jul 24 '25 08:07 Nelwhix