gloo
gloo copied to clipboard
UpstreamGroup continues to send traffic to unhealthy upstream.
Gloo Edge Product
Open Source
Gloo Edge Version
V1.13.6 (But encountered this in other versions as well)
Kubernetes Version
v1.25
Describe the bug
We have an UpstreamGroup where healthchecks are configured on Upstream. But UpstreamGroup continues to send traffic to unhealthy upstream.
Expected Behavior
Expectation is that if the upstream is not healthy then upstream group will not route the request through that upstream. But it seems that upstream group is not aware of the upstream health.
Steps to reproduce the bug
- Create 2 services eg kubectl create ns echo kubectl -n echo apply -f https://raw.githubusercontent.com/solo-io/workshops/master/gloo-edge/data/echo-service.yaml kubectl -n echo apply -f https://raw.githubusercontent.com/solo-io/workshops/master/gloo-edge/data/echo-v2-service.yaml
- Create Upstream Group and Upstream as below
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
labels:
discovered_by: kubernetesplugin
name: echo-echo-v2-8080
namespace: gloo-system
spec:
discoveryMetadata:
labels:
app: echo-v2
healthChecks:
- alwaysLogHealthCheckFailures: true
eventLogPath: /dev/stdout
healthyThreshold: 3
httpHealthCheck:
path: /health
interval: 10s
reuseConnection: false
timeout: 10s
unhealthyThreshold: 1
kube:
selector:
app: echo-v2
serviceName: echo-v2
serviceNamespace: echo
servicePort: 8080
loadBalancerConfig:
healthyPanicThreshold: 0
ringHash: {}
apiVersion: gloo.solo.io/v1
kind: UpstreamGroup
metadata:
name: my-service-group
namespace: gloo-system
spec:
destinations:
- destination:
upstream:
name: echo-echo-v1-8080
namespace: gloo-system
weight: 5
- destination:
upstream:
name: echo-echo-v2-8080
namespace: gloo-system
weight: 5
- Update Virtual service configuration
routes:
- delegateAction:
ref:
name: echo-routetable
namespace: echo
matchers:
- headers:
- name: :authority
value: echo.solo.io
prefix: /
- Update Route Table config
spec:
routes:
- matchers:
- headers:
- name: :authority
value: echo.solo.io
prefix: /
routeAction:
upstreamGroup:
name: my-service-group
namespace: gloo-system
-
Make one of the upstreams unhealthy( eg. scale deployment replica to 1 of echo app)
-
Send traffic for i in {1..10}; do curl -H "Host: echo.solo.io" $(glooctl proxy url)/ done
-
Observe No healthy upstream errors on half the requests.
Additional Environment Detail
No response
Additional Context
No response
Could that be a dupe of https://github.com/solo-io/gloo/issues/6647 @AkshayAdsul ?
In this bug we get a no healthy upstream error rather than upstream connect error or disconnect/reset before headers. reset reason: connection termination . So it detects the upstream as unhealthy but UpStreamGroup still continues to send traffic to it.
@AkshayAdsul did that already work in a previous version? Or is that new?
I tried this on 1.13.6 and 1.15.x versions and I saw the same issue. I don't think this ever worked. Kas looked at the code and he mentioned we incorrectly write empty endpoints for the upstream. And it is not an issue when managing subsets in a single upstream. So it's the UpstreamGroup.
I'd guess healthy_panic_threshold: {} might have something to do with this in their env. In your sample steps what happens if you turn off UDS and manually define v1 service upstream with the setting explicitly set as v2 has shown above?
Healthy panic threshold needs to be zero to properly get the no_healthy_upstream respone flag/details (link)
In this bug we get a no healthy upstream error rather than upstream connect error or disconnect/reset before headers. reset reason: connection termination . So it detects the upstream as unhealthy but UpStreamGroup still continues to send traffic to it.
If you're receiving this response details, then Envoy should never have sent the upstream request (link)
When we set healthyPanicThreshold: 0 on an Upstream I see in the config dump the value is set as "healthy_panic_threshold": {} rather than 0. Which means its setting to default value which is 50 and not 0.
I dont know about the healthyPanicThreshold but have a look at the endpoints this creates. I think when multiple upstreams are involved with health checks it doesnt quite manage those endpoints correctly.
This works fine with a single upstream with multiple subsets by the way.
After more investigation I think we may have a fundamental mismatch in use case here. I believe what we are looking at is Upstreamgroup -> weighted_cluster on routing which explicitly is fine with some requests failing. While what is trying to be done here is to have some load balancing plus not to send to unhealthy desitnations.
I am still confirming what the actual behavior is fully but I think we may have to introduce a new concept or flag to actually provide this type of functionality. For example instead of just having the route action with weights actually have the destinations of those weights be aggregate clusters with the other upstream group members as secondary/tertiary locations.
Disclaimer this is just the result of a quick initial investigation.
Alright have confirmed the prior statements that Upstream groups dont respect health between the upstream clusters as clusters contain their concept of health and an Upstream group is a route paradigm rather than a cluster level configuration.
@SantoDE with follow up with the currently affected user and we will determine whether there is a configuration that would be acceptable to reconfigure the upstreams to be a single upstream with some load balancing setup (such as by locality based) or if we should approach adding aggregate clusters as a concept in Gloo Edge.
This is a valuable learning that we should capture somewhere. User-facing docs may be a good place to clarify that this translates to routing and not cluster config.
On Tue, Oct 10, 2023 at 09:20 Nathan Fudenberg @.***> wrote:
Alright have confirmed the prior statements that Upstream groups dont respect health between the upstream clusters as clusters contain their concept of health and an Upstream group is a route paradigm rather than a cluster level configuration.
@SantoDE https://github.com/SantoDE with follow up with the currently affected user and we will determine whether there is a configuration that would be acceptable to reconfigure the upstreams to be a single upstream with some load balancing setup (such as by locality based) or if we should approach adding aggregate clusters as a concept in Gloo Edge.
— Reply to this email directly, view it on GitHub https://github.com/solo-io/gloo/issues/8715#issuecomment-1755416391, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANTAA56OQQVYLALGWGNTIV3X6VDQVAVCNFSM6AAAAAA5GHZUB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVGQYTMMZZGE . You are receiving this because you commented.Message ID: @.***>
Yep thats one of the steps we are taking :P