kong icon indicating copy to clipboard operation
kong copied to clipboard

At some point, connection between ingress controller with proxy sidecar disappeared and the reconcile of httproute fails forever.

Open parkjeongryul opened this issue 2 years ago • 8 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Kong version ($ kong version)

Kong 3.2 and kong ingress controller v2.9.2

Current Behavior

We Installed KIC using helm chart with values like below.

          image:
            repository: reg.navercorp.com/clous-public/kong-ubuntu
            tag: "3.2.2"
          ingressController:
            image:
              repository: reg.navercorp.com/clous-public/kong-ingress-controller
              tag: "v2.9.2"
            env:
              ingress_class: clous-api
              gateway_api_controller_name: konghq.com/clous-api-gateway-controller
            ingressClass: clous-api
          proxy:
            nameOverride: test-gateway
          replicaCount: 30
          podAnnotations:
          # TODO: Attach linkerd after investigation 
            # linkerd.io/inject: ingress
            # config.linkerd.io/proxy-memory-limit: 2Gi
            # To add the x-forwared-* header, skip the service mesh for inbound traffic.
            # config.linkerd.io/skip-inbound-ports: 8000,8443
          serviceMonitor:
            enabled: true

And We use gateway API and HTTPRoute.

We have an issue where it works fine at first, but after a certain time (about 1~3 days after the KIC pod is created?), the httproute reconcile fails forever.

Routing for exisiting HTTPRoute works well. but newly created httproute cannot be routed with this response.

// After that point, only newly created httproutes
{
message: "no Route matched with those values"
}

So we investigated.

  1. First, Connection with kong admin(maybe it listen 8443 port) was disappeared.
// netstat in kong ingress controller pod.
tcp        0      0 localhost:8444          localhost:53612         ESTABLISHED 
tcp        0      0 localhost:53612         localhost:8444          ESTABLISHED 

Normal ingress pod has this connection with netstat. But when ingress controller was broken (which cannot reconcile), we cannot find the connection in this pod.

  1. This is logs for proxy container and ingress-controller container.
  • GET /status HTTP/2.0 logs stoped.
// proxy container
127.0.0.1 - - [22/May/2023:11:39:48 +0900] "GET /status HTTP/2.0" 200 1226 "-" "Go-http-client/2.0"
127.0.0.1 - - [22/May/2023:11:39:51 +0900] "GET /status HTTP/2.0" 200 1226 "-" "Go-http-client/2.0"
127.0.0.1 - - [22/May/2023:11:39:54 +0900] "GET /status HTTP/2.0" 200 1226 "-" "Go-http-client/2.0"
127.0.0.1 - - [22/May/2023:11:39:57 +0900] "GET /status HTTP/2.0" 200 1226 "-" "Go-http-client/2.0"
127.0.0.1 - - [22/May/2023:11:40:00 +0900] "GET /status HTTP/2.0" 200 1226 "-" "Go-http-client/2.0"
2023/05/22 11:40:03 [notice] 1198#0: *92268 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, client: 127.0.0.1, server: kong_admin, request: "POST /config?check_hash=1&flatten_errors=1 HTTP/2.0", host: "localhost:8444"
2023/05/22 11:40:03 [notice] 1198#0: *92268 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, client: 127.0.0.1, server: kong_admin, request: "POST /config?check_hash=1&flatten_errors=1 HTTP/2.0", host: "localhost:8444"
127.0.0.1 - - [22/May/2023:11:40:03 +0900] "POST /config?check_hash=1&flatten_errors=1 HTTP/2.0" 201 94728 "-" "Go-http-client/2.0"
2023/05/22 11:40:03 [notice] 1199#0: *8 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/22 11:40:03 [notice] 1199#0: *8 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/22 11:40:03 [notice] 1198#0: *4 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/22 11:40:03 [notice] 1198#0: *4 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
// ingress controller
time="2023-05-22T11:40:01+09:00" level=info msg="successfully synced configuration to kong" kong_url="https://localhost:8444"
time="2023-05-22T11:40:01+09:00" level=error msg="Reconciler error" error="Operation cannot be fulfilled on httproutes.gateway.networking.k8s.io \"clous-api\": the object has been modified; please apply your changes to the latest version and try again" logger=controllers.HTTPRoute reconcileID="\"4f7db20a-3163-4c26-aac3-1a5c66cecc1e\""
time="2023-05-22T11:40:01+09:00" level=info msg="httproute has been configured on the data-plane" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-ngs-mem-00001404\",\"Name\":\"clous-api\"}" logger=controllers.HTTPRoute name=clous-api namespace=clous-ngs-mem-00001404
time="2023-05-22T11:40:03+09:00" level=error msg="Reconciler error" error="Operation cannot be fulfilled on httproutes.gateway.networking.k8s.io \"clous-api\": the object has been modified; please apply your changes to the latest version and try again" logger=controllers.HTTPRoute reconcileID="\"d904ce6e-26c4-4610-92db-cb2cae6b5463\""
time="2023-05-22T11:40:04+09:00" level=warning msg="no active endpoints" service_name=ras service_namespace=clous-jykim-2305111602 service_port="&ServicePort{Name:ras,Protocol:TCP,Port:80,TargetPort:{1 0 http},NodePort:0,AppProtocol:nil,}"
time="2023-05-22T11:40:04+09:00" level=info msg="no targets could be found for kubernetes service clous-jykim-2305111602/ras" service_name=httproute.clous-jykim-2305111602.clous-api-ras.0
time="2023-05-22T11:40:04+09:00" level=info msg="no targets found to create upstream" service_name=httproute.clous-jykim-2305111602.clous-api-ras.0
time="2023-05-22T11:40:04+09:00" level=warning msg="no active endpoints" service_name=ginad service_namespace=clous-ngs-storage-00001404 service_port="&ServicePort{Name:ginad,Protocol:TCP,Port:80,TargetPort:{1 0 http},NodePort:31837,AppProtocol:nil,}"
time="2023-05-22T11:40:04+09:00" level=info msg="no targets could be found for kubernetes service clous-ngs-storage-00001404/ginad" service_name=httproute.clous-ngs-storage-00001404.clous-api.0
time="2023-05-22T11:40:04+09:00" level=info msg="no targets found to create upstream" service_name=httproute.clous-ngs-storage-00001404.clous-api.0
time="2023-05-22T11:40:04+09:00" level=info msg="successfully synced configuration to kong" kong_url="https://localhost:8444"
time="2023-05-22T11:40:04+09:00" level=info msg="httproute has been configured on the data-plane" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-ngs-storage-00001404\",\"Name\":\"clous-api\"}" logger=controllers.HTTPRoute name=clous-api namespace=clous-ngs-storage-00001404
time="2023-05-22T11:40:05+09:00" level=info msg="httproute has been configured on the data-plane" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-ngs-storage-00001404\",\"Name\":\"clous-api\"}" logger=controllers.HTTPRoute name=clous-api namespace=clous-ngs-storage-00001404
time="2023-05-22T11:40:06+09:00" level=error msg="Reconciler error" error="Operation cannot be fulfilled on httproutes.gateway.networking.k8s.io \"clous-api\": the object has been modified; please apply your changes to the latest version and try again" logger=controllers.HTTPRoute reconcileID="\"7236d421-e56b-49a0-bad4-f782dc7c9efd\""
time="2023-05-22T11:40:08+09:00" level=error msg="Reconciler error" error="Operation cannot be fulfilled on httproutes.gateway.networking.k8s.io \"clous-api\": the object has been modified; please apply your changes to the latest version and try again" logger=controllers.HTTPRoute reconcileID="\"a8c7d6cc-61d9-43c8-b777-483a49314a6e\""

Expected Behavior

Ingress controller reconcile httproute consistently.

Steps To Reproduce

  1. Install Kong ingress controller with helm.
  2. Create gatewayclass and gateway
  3. Create Httproutes and it works.
  4. After 1 ~ 3 days (Suddenly), Httproutes couldn't be reconciled.

Anything else?

And we can fix it by restarting ingress controller pods. But this behavior was reproduced again.

parkjeongryul avatar May 22 '23 12:05 parkjeongryul

This is other logs with changing log level to debug.

// proxy container log

127.0.0.1 - - [17/May/2023:01:12:29 +0900] "GET /status HTTP/2.0" 200 1224 "-" "Go-http-client/2.0"
2023/05/17 01:12:32 [notice] 1198#0: *33914 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, client: 127.0.0.1, server: kong_admin, request: "POST /config?check_hash=1&flatten_errors=1 HTTP/2.0", host: "localhost:8444"
2023/05/17 01:12:32 [notice] 1198#0: *33914 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, client: 127.0.0.1, server: kong_admin, request: "POST /config?check_hash=1&flatten_errors=1 HTTP/2.0", host: "localhost:8444"
127.0.0.1 - - [17/May/2023:01:12:32 +0900] "POST /config?check_hash=1&flatten_errors=1 HTTP/2.0" 201 63571 "-" "Go-http-client/2.0"
2023/05/17 01:12:32 [notice] 1199#0: *8 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/17 01:12:32 [notice] 1199#0: *8 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/17 01:12:32 [notice] 1198#0: *4 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/17 01:12:32 [notice] 1198#0: *4 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/17 01:12:33 [error] 1198#0: *34149 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 01:43:38 [error] 1199#0: *34788 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 02:43:38 [error] 1198#0: *35832 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 03:43:38 [error] 1199#0: *36869 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 04:43:38 [error] 1198#0: *37833 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 05:43:38 [error] 1199#0: *38983 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 06:43:38 [error] 1198#0: *39841 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 08:43:38 [error] 1198#0: *41917 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 09:43:38 [error] 1198#0: *42962 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 10:43:38 [error] 1198#0: *43966 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 11:43:38 [error] 1199#0: *45193 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 13:43:38 [error] 1199#0: *47249 peer closed connection in SSL handshake, context: ngx.timer

// ingress-controller log

time="2023-05-17T14:27:32+09:00" level=info msg="processing httproute" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
time="2023-05-17T14:27:32+09:00" level=info msg="checking deletion timestamp" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
time="2023-05-17T14:27:32+09:00" level=info msg="retrieving GatewayClass and Gateway for route" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
time="2023-05-17T14:27:32+09:00" level=info msg="checking if the httproute's gateways are ready" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
time="2023-05-17T14:27:32+09:00" level=info msg="ensuring status contains Gateway associations" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
time="2023-05-17T14:27:32+09:00" level=info msg="ensuring status contains Programmed condition" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
time="2023-05-17T14:27:32+09:00" level=info msg="httproute not configured,requeueing" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402

parkjeongryul avatar May 22 '23 12:05 parkjeongryul

I think this loop has stopped.

https://github.com/Kong/kubernetes-ingress-controller/blob/5d29975d22006311fd5b94e721f2799f2e7b22e6/internal/dataplane/synchronizer.go#L184-L192

parkjeongryul avatar May 23 '23 00:05 parkjeongryul

@randmonkey Can you help to take a look?

oowl avatar May 24 '23 08:05 oowl

We did some more digging and found a reproduction path. In the process of performing our E2E tests heavily, we found that the phenomenon was reproduced.

Our E2E tests are performed by creating a namespace, deploying httproute and our app in the namespace, and deleting namespace very rapidly. This test can be done again with the same namespace name and it can be done in parallel with multiple namespaces. (This e2e tests are included in our CI)

We haven't found why this test breaks the ingress-controller (in the code). I think the ingress controller should not fail, even if this test is performed heavily.

And I think the key clue is this log. It's a single log from the ingress controller container at the time the problem occurs.

time="2023-05-24T11:17:12+09:00" level=error msg="Reconciler error" error="Operation cannot be fulfilled on httproutes.gateway.networking.k8s.io \"clous-api\": StorageError: invalid object, Code: 4, Key: /registry/gateway.networking.k8s.io/httproutes/clous-ngs-storage-00001415/clous-api, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 4739b314-c874-4953-a024-12d2662fbfdc, UID in object meta: " logger=controllers.HTTPRoute reconcileID="\"6cbb6736-5ca6-43a8-b900-cc4cf977a83f\""

Any have idea?

parkjeongryul avatar May 24 '23 09:05 parkjeongryul

I'm not sure, but Is this can be related? https://github.com/kubernetes-sigs/controller-runtime/issues/1881

parkjeongryul avatar May 26 '23 04:05 parkjeongryul

This is our destroy-kong.sh The behavior is reproduced near the end of the script.

user@AL01974378 vks % cat destroy-kong.sh 
#!/bin/bash

# Function to create and delete HttpRoute
create_and_delete_httproute() {
  for j in {1..100}
  do
    # Define HttpRoute, Service, and Deployment YAML content as a bash variable.
    RESOURCES_YAML=$(cat <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: echo-ns-$1
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: echo-tmp-$1
  namespace: echo-ns-$1
  annotations:
    konghq.com/strip-path: 'true'
spec:
  parentRefs:
  - name: clous-api
    namespace: clous-api
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /echo-tmp-$1
    backendRefs:
    - name: echo
      kind: Service
      port: 1027
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: echo
  name: echo
  namespace: echo-ns-$1
spec:
  ports:
  - port: 1025
    name: tcp
    protocol: TCP
    targetPort: 1025
  - port: 1026
    name: udp
    protocol: TCP
    targetPort: 1026
  - port: 1027
    name: http
    protocol: TCP
    targetPort: 1027
  selector:
    app: echo
EOF
    )

    # Create HttpRoute, Service, and Deployment
    echo "Creating resources in echo-ns-$1... $j steps"
    echo "${RESOURCES_YAML}" | kubectl apply -f -

    # Delete Namespace
    echo "Deleting namespace echo-ns-$1... $j steps"
    kubectl delete ns echo-ns-$1
  done
}

# Start 40 parallel tasks to create and delete HttpRoutes, Services, and Deployments
for i in {1..40}
do
  create_and_delete_httproute $i &
done

# Wait for all
wait

parkjeongryul avatar Jun 01 '23 17:06 parkjeongryul

I think i found the reason

https://github.com/parkjeongryul/kubernetes-ingress-controller/commit/e0d7145dc99d93810f7d738482a05af83e68137b

There is kubernetesObjectStatusQueue. (which size is 8192)

But there is only a PUBLISHER(push to chan) for that queue and no SUBSCRIBERS(pop from chan)... (And why do we need this queue?)

So it seems that after a certain GKV's 8192 event (creation or deletion), the kong ingress controller is forever broken.

cc. @randmonkey

parkjeongryul avatar Jun 01 '23 17:06 parkjeongryul

@rainest @randmonkey Can you take a look at this issue? If you can confirm that this is a controller issue, I'll try to fix it.

parkjeongryul avatar Jun 21 '23 07:06 parkjeongryul

https://github.com/Kong/kubernetes-ingress-controller/issues/4260

Done after v2.10.2.

parkjeongryul avatar Jul 20 '23 05:07 parkjeongryul