apisix bug: Upstream HealthCheck Issue - Unhealthy Upstream doesn't be excluded temporarily

Current Behavior

Same as apache/apisix-ingress-controller#2176.

There is a problem with the unhealthy external service being delivered as it is without being excluded from routing targets.

Mar-06-2024 17-37-44

Expected Behavior

Two external services (ALB configured in front of each) are configured as upstream nodes and should be temporarily excluded from routing if a 5XX error occurs through health check configuration.

Error Logs

No response

Steps to Reproduce

For reproducing the issue, one service is deployed, and the other one is not deployed (only ALB's are set up.)

apiVersion: apisix.apache.org/v2
kind: ApisixRoute
metadata:
  name: route
  namespace: apisix
spec:
  http:
    - match:
        hosts:
          - kubernetes.corp.com
        methods:
          - POST
        paths:
          - /svc/*
      name: route
      plugins:
        - config:
            regex_uri:
              - ^\/svc\/(.+)$
              - /$1
          enable: true
          name: proxy-rewrite
      upstreams:
        - name: upstream
          weight: 100

apiVersion: apisix.apache.org/v2
kind: ApisixUpstream
metadata:
  name: upstream
  namespace: apisix
spec:
  externalNodes:
    - name: svc01.corp.com
      port: 443
      type: Domain
      weight: 50
    - name: svc02.corp.com
      port: 443
      type: Domain
      weight: 50
  healthCheck:
    active:
      healthy:
        httpCodes:
          - 200
          - 404
        interval: 3s
        successes: 1
      httpPath: /
      type: https
      unhealthy:
        httpCodes:
          - 500
          - 501
          - 502
          - 503
          - 504
        httpFailures: 1
        tcpFailures: 1
        interval: 3s
        timeouts: 3
    passive:
      healthy:
        httpCodes:
          - 200
          - 404
        successes: 1
      type: https
      unhealthy:
        httpCodes:
          - 500
          - 501
          - 502
          - 503
          - 504
        httpFailures: 1
        tcpFailures: 1
        timeouts: 3
  loadbalancer:
    type: roundrobin
  passHost: node
  scheme: https

Environment

APISIX Ingress controller version (run apisix-ingress-controller version --long) Kubernetes cluster version (run kubectl version) OS version if running APISIX Ingress controller in a bare-metal environment (run uname -a) Runs on an AWS EKS Cluster (Kubernetes v1.25). Uses APISIX Helm Chart (1.11.0, App 3.8.0).

Mar 08 '24 07:03 kworkbee

Can you retrieve health check information ?
curl -i http://127.0.0.1:9090/v1/healthcheck

Mar 11 '24 02:03 hanqingwu

does this bug exist even if you don't use the ingress controller?

Mar 11 '24 04:03 shreemaan-abhishek

@hanqingwu The node that should be Unhealthy is marked Healthy.

Log shows below (Failed SSL Handshake):

2024/03/11 06:46:31 [error] 50#50: *4510567 [lua] healthcheck.lua:1383: log(): [healthcheck] (upstream#/apisix/upstreams/23eb23c7) failed SSL handshake with 'X.X.X.X (X.X.X.X:443)', using server name (sni) 'svc02.corp.com': 19: self-signed certificate in certificate chain, context: ngx.timer, client: X.X.X.X, server: 0.0.0.0:9080

@shreemaan-abhishek The same symptom appears even when the ingress controller is not deployed.

Mar 11 '24 06:03 kworkbee

@kworkbee please share repro steps for apisix.

Mar 13 '24 15:03 shreemaan-abhishek

@shreemaan-abhishek I would like to apply it in the following form.

With Helm Chart, APISIX is installed in the tools cluster and ApisixRoute/ ApisixUpstream objects are deployed as written in the description above.

I want to configure it to route to 50:50 and when certain clusters fail, I want to adjust the weight to the rest of the cluster.

However, despite the Upstream Health Check setting, there is a problem that it is not possible to automatically exclude Upstream, which is currently 503.

The parts found in the APISIX Log are as follows.

2024/03/18 11:46:03 [error] 49#49: *14808 [lua] healthcheck.lua:1383: log(): [healthcheck] (upstream#/apisix/upstreams/32eb11c7) failed SSL handshake with 'X.X.X.X (X.X.X.X:443)', using server name (sni) 'svc01.corp.com': 19: self-signed certificate in certificate chain, context: ngx.timer, client: X.X.X.X, server: 0.0.0.0:9080
2024/03/18 11:46:06 [warn] 49#49: *14846 [lua] balancer.lua:82: fetch_health_nodes(): failed to get health check target status, addr: X.X.X.X:443, host: nil, err: target not found, client: X.X.X.X, server: _, request: "POST /feature-flags/flagd.evaluation.v1.Service/ResolveBoolean HTTP/1.1", host: "kubernetes.corp.com"

Mar 18 '24 02:03 kworkbee

19: self-signed certificate in certificate chain Does that matter?

Mar 18 '24 06:03 kworkbee

@shreemaan-abhishek Can you please take a look?

Mar 28 '24 04:03 kworkbee