bug: Upstream HealthCheck Issue - Unhealthy Upstream doesn't be excluded temporarily
Current Behavior
Same as apache/apisix-ingress-controller#2176.
There is a problem with the unhealthy external service being delivered as it is without being excluded from routing targets.
Expected Behavior
Two external services (ALB configured in front of each) are configured as upstream nodes and should be temporarily excluded from routing if a 5XX error occurs through health check configuration.
Error Logs
No response
Steps to Reproduce
For reproducing the issue, one service is deployed, and the other one is not deployed (only ALB's are set up.)
apiVersion: apisix.apache.org/v2
kind: ApisixRoute
metadata:
name: route
namespace: apisix
spec:
http:
- match:
hosts:
- kubernetes.corp.com
methods:
- POST
paths:
- /svc/*
name: route
plugins:
- config:
regex_uri:
- ^\/svc\/(.+)$
- /$1
enable: true
name: proxy-rewrite
upstreams:
- name: upstream
weight: 100
apiVersion: apisix.apache.org/v2
kind: ApisixUpstream
metadata:
name: upstream
namespace: apisix
spec:
externalNodes:
- name: svc01.corp.com
port: 443
type: Domain
weight: 50
- name: svc02.corp.com
port: 443
type: Domain
weight: 50
healthCheck:
active:
healthy:
httpCodes:
- 200
- 404
interval: 3s
successes: 1
httpPath: /
type: https
unhealthy:
httpCodes:
- 500
- 501
- 502
- 503
- 504
httpFailures: 1
tcpFailures: 1
interval: 3s
timeouts: 3
passive:
healthy:
httpCodes:
- 200
- 404
successes: 1
type: https
unhealthy:
httpCodes:
- 500
- 501
- 502
- 503
- 504
httpFailures: 1
tcpFailures: 1
timeouts: 3
loadbalancer:
type: roundrobin
passHost: node
scheme: https
Environment
APISIX Ingress controller version (run apisix-ingress-controller version --long) Kubernetes cluster version (run kubectl version) OS version if running APISIX Ingress controller in a bare-metal environment (run uname -a) Runs on an AWS EKS Cluster (Kubernetes v1.25). Uses APISIX Helm Chart (1.11.0, App 3.8.0).
Can you retrieve health check information ?
curl -i http://127.0.0.1:9090/v1/healthcheck
does this bug exist even if you don't use the ingress controller?
@hanqingwu The node that should be Unhealthy is marked Healthy.
Log shows below (Failed SSL Handshake):
2024/03/11 06:46:31 [error] 50#50: *4510567 [lua] healthcheck.lua:1383: log(): [healthcheck] (upstream#/apisix/upstreams/23eb23c7) failed SSL handshake with 'X.X.X.X (X.X.X.X:443)', using server name (sni) 'svc02.corp.com': 19: self-signed certificate in certificate chain, context: ngx.timer, client: X.X.X.X, server: 0.0.0.0:9080
@shreemaan-abhishek The same symptom appears even when the ingress controller is not deployed.
@kworkbee please share repro steps for apisix.
@shreemaan-abhishek I would like to apply it in the following form.

With Helm Chart, APISIX is installed in the tools cluster and ApisixRoute/ ApisixUpstream objects are deployed as written in the description above.
I want to configure it to route to 50:50 and when certain clusters fail, I want to adjust the weight to the rest of the cluster.
However, despite the Upstream Health Check setting, there is a problem that it is not possible to automatically exclude Upstream, which is currently 503.
The parts found in the APISIX Log are as follows.
2024/03/18 11:46:03 [error] 49#49: *14808 [lua] healthcheck.lua:1383: log(): [healthcheck] (upstream#/apisix/upstreams/32eb11c7) failed SSL handshake with 'X.X.X.X (X.X.X.X:443)', using server name (sni) 'svc01.corp.com': 19: self-signed certificate in certificate chain, context: ngx.timer, client: X.X.X.X, server: 0.0.0.0:9080
2024/03/18 11:46:06 [warn] 49#49: *14846 [lua] balancer.lua:82: fetch_health_nodes(): failed to get health check target status, addr: X.X.X.X:443, host: nil, err: target not found, client: X.X.X.X, server: _, request: "POST /feature-flags/flagd.evaluation.v1.Service/ResolveBoolean HTTP/1.1", host: "kubernetes.corp.com"
19: self-signed certificate in certificate chain Does that matter?
@shreemaan-abhishek Can you please take a look?