gloo icon indicating copy to clipboard operation
gloo copied to clipboard

UpstreamGroup continues to send traffic to unhealthy upstream.

Open AkshayAdsul opened this issue 2 years ago • 12 comments

Gloo Edge Product

Open Source

Gloo Edge Version

V1.13.6 (But encountered this in other versions as well)

Kubernetes Version

v1.25

Describe the bug

We have an UpstreamGroup where healthchecks are configured on Upstream. But UpstreamGroup continues to send traffic to unhealthy upstream.

Expected Behavior

Expectation is that if the upstream is not healthy then upstream group will not route the request through that upstream. But it seems that upstream group is not aware of the upstream health.

Steps to reproduce the bug

  1. Create 2 services eg kubectl create ns echo kubectl -n echo apply -f https://raw.githubusercontent.com/solo-io/workshops/master/gloo-edge/data/echo-service.yaml kubectl -n echo apply -f https://raw.githubusercontent.com/solo-io/workshops/master/gloo-edge/data/echo-v2-service.yaml
  2. Create Upstream Group and Upstream as below
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  labels:
    discovered_by: kubernetesplugin
  name: echo-echo-v2-8080
  namespace: gloo-system
spec:
  discoveryMetadata:
    labels:
      app: echo-v2
  healthChecks:
  - alwaysLogHealthCheckFailures: true
    eventLogPath: /dev/stdout
    healthyThreshold: 3
    httpHealthCheck:
      path: /health
    interval: 10s
    reuseConnection: false
    timeout: 10s
    unhealthyThreshold: 1
  kube:
    selector:
      app: echo-v2
    serviceName: echo-v2
    serviceNamespace: echo
    servicePort: 8080
  loadBalancerConfig:
    healthyPanicThreshold: 0
    ringHash: {}


apiVersion: gloo.solo.io/v1
kind: UpstreamGroup
metadata:
  name: my-service-group
  namespace: gloo-system
spec:
  destinations:
  - destination:
      upstream:
        name: echo-echo-v1-8080
        namespace: gloo-system
    weight: 5
  - destination:
      upstream:
        name: echo-echo-v2-8080
        namespace: gloo-system
    weight: 5
  1. Update Virtual service configuration
 routes:
   - delegateAction:
       ref:
         name: echo-routetable
         namespace: echo
     matchers:
     - headers:
       - name: :authority
         value: echo.solo.io
       prefix: /
  1. Update Route Table config
spec:
  routes:
  - matchers:
    - headers:
      - name: :authority
        value: echo.solo.io
      prefix: /
    routeAction:
      upstreamGroup:
        name: my-service-group
        namespace: gloo-system
  1. Make one of the upstreams unhealthy( eg. scale deployment replica to 1 of echo app)

  2. Send traffic for i in {1..10}; do curl -H "Host: echo.solo.io" $(glooctl proxy url)/ done

  3. Observe No healthy upstream errors on half the requests.

Additional Environment Detail

No response

Additional Context

No response

AkshayAdsul avatar Sep 25 '23 15:09 AkshayAdsul

Could that be a dupe of https://github.com/solo-io/gloo/issues/6647 @AkshayAdsul ?

SantoDE avatar Sep 27 '23 12:09 SantoDE

In this bug we get a no healthy upstream error rather than upstream connect error or disconnect/reset before headers. reset reason: connection termination . So it detects the upstream as unhealthy but UpStreamGroup still continues to send traffic to it.

AkshayAdsul avatar Sep 28 '23 07:09 AkshayAdsul

@AkshayAdsul did that already work in a previous version? Or is that new?

SantoDE avatar Sep 28 '23 14:09 SantoDE

I tried this on 1.13.6 and 1.15.x versions and I saw the same issue. I don't think this ever worked. Kas looked at the code and he mentioned we incorrectly write empty endpoints for the upstream. And it is not an issue when managing subsets in a single upstream. So it's the UpstreamGroup.

AkshayAdsul avatar Sep 29 '23 05:09 AkshayAdsul

I'd guess healthy_panic_threshold: {} might have something to do with this in their env. In your sample steps what happens if you turn off UDS and manually define v1 service upstream with the setting explicitly set as v2 has shown above?

Healthy panic threshold needs to be zero to properly get the no_healthy_upstream respone flag/details (link)

jbohanon avatar Sep 29 '23 11:09 jbohanon

In this bug we get a no healthy upstream error rather than upstream connect error or disconnect/reset before headers. reset reason: connection termination . So it detects the upstream as unhealthy but UpStreamGroup still continues to send traffic to it.

If you're receiving this response details, then Envoy should never have sent the upstream request (link)

jbohanon avatar Sep 29 '23 11:09 jbohanon

When we set healthyPanicThreshold: 0 on an Upstream I see in the config dump the value is set as "healthy_panic_threshold": {} rather than 0. Which means its setting to default value which is 50 and not 0.

AkshayAdsul avatar Oct 03 '23 00:10 AkshayAdsul

I dont know about the healthyPanicThreshold but have a look at the endpoints this creates. I think when multiple upstreams are involved with health checks it doesnt quite manage those endpoints correctly.

This works fine with a single upstream with multiple subsets by the way.

day0ops avatar Oct 04 '23 23:10 day0ops

After more investigation I think we may have a fundamental mismatch in use case here. I believe what we are looking at is Upstreamgroup -> weighted_cluster on routing which explicitly is fine with some requests failing. While what is trying to be done here is to have some load balancing plus not to send to unhealthy desitnations.

I am still confirming what the actual behavior is fully but I think we may have to introduce a new concept or flag to actually provide this type of functionality. For example instead of just having the route action with weights actually have the destinations of those weights be aggregate clusters with the other upstream group members as secondary/tertiary locations.

Disclaimer this is just the result of a quick initial investigation.

nfuden avatar Oct 09 '23 18:10 nfuden

Alright have confirmed the prior statements that Upstream groups dont respect health between the upstream clusters as clusters contain their concept of health and an Upstream group is a route paradigm rather than a cluster level configuration.

@SantoDE with follow up with the currently affected user and we will determine whether there is a configuration that would be acceptable to reconfigure the upstreams to be a single upstream with some load balancing setup (such as by locality based) or if we should approach adding aggregate clusters as a concept in Gloo Edge.

nfuden avatar Oct 10 '23 13:10 nfuden

This is a valuable learning that we should capture somewhere. User-facing docs may be a good place to clarify that this translates to routing and not cluster config.

On Tue, Oct 10, 2023 at 09:20 Nathan Fudenberg @.***> wrote:

Alright have confirmed the prior statements that Upstream groups dont respect health between the upstream clusters as clusters contain their concept of health and an Upstream group is a route paradigm rather than a cluster level configuration.

@SantoDE https://github.com/SantoDE with follow up with the currently affected user and we will determine whether there is a configuration that would be acceptable to reconfigure the upstreams to be a single upstream with some load balancing setup (such as by locality based) or if we should approach adding aggregate clusters as a concept in Gloo Edge.

— Reply to this email directly, view it on GitHub https://github.com/solo-io/gloo/issues/8715#issuecomment-1755416391, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANTAA56OQQVYLALGWGNTIV3X6VDQVAVCNFSM6AAAAAA5GHZUB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVGQYTMMZZGE . You are receiving this because you commented.Message ID: @.***>

jbohanon avatar Oct 10 '23 13:10 jbohanon

Yep thats one of the steps we are taking :P

nfuden avatar Oct 10 '23 13:10 nfuden