At some point, connection between ingress controller with proxy sidecar disappeared and the reconcile of httproute fails forever.
Is there an existing issue for this?
- [X] I have searched the existing issues
Kong version ($ kong version)
Kong 3.2 and kong ingress controller v2.9.2
Current Behavior
We Installed KIC using helm chart with values like below.
image:
repository: reg.navercorp.com/clous-public/kong-ubuntu
tag: "3.2.2"
ingressController:
image:
repository: reg.navercorp.com/clous-public/kong-ingress-controller
tag: "v2.9.2"
env:
ingress_class: clous-api
gateway_api_controller_name: konghq.com/clous-api-gateway-controller
ingressClass: clous-api
proxy:
nameOverride: test-gateway
replicaCount: 30
podAnnotations:
# TODO: Attach linkerd after investigation
# linkerd.io/inject: ingress
# config.linkerd.io/proxy-memory-limit: 2Gi
# To add the x-forwared-* header, skip the service mesh for inbound traffic.
# config.linkerd.io/skip-inbound-ports: 8000,8443
serviceMonitor:
enabled: true
And We use gateway API and HTTPRoute.
We have an issue where it works fine at first, but after a certain time (about 1~3 days after the KIC pod is created?), the httproute reconcile fails forever.
Routing for exisiting HTTPRoute works well. but newly created httproute cannot be routed with this response.
// After that point, only newly created httproutes
{
message: "no Route matched with those values"
}
So we investigated.
- First, Connection with kong admin(maybe it listen 8443 port) was disappeared.
// netstat in kong ingress controller pod.
tcp 0 0 localhost:8444 localhost:53612 ESTABLISHED
tcp 0 0 localhost:53612 localhost:8444 ESTABLISHED
Normal ingress pod has this connection with netstat. But when ingress controller was broken (which cannot reconcile), we cannot find the connection in this pod.
- This is logs for proxy container and ingress-controller container.
- GET /status HTTP/2.0 logs stoped.
// proxy container
127.0.0.1 - - [22/May/2023:11:39:48 +0900] "GET /status HTTP/2.0" 200 1226 "-" "Go-http-client/2.0"
127.0.0.1 - - [22/May/2023:11:39:51 +0900] "GET /status HTTP/2.0" 200 1226 "-" "Go-http-client/2.0"
127.0.0.1 - - [22/May/2023:11:39:54 +0900] "GET /status HTTP/2.0" 200 1226 "-" "Go-http-client/2.0"
127.0.0.1 - - [22/May/2023:11:39:57 +0900] "GET /status HTTP/2.0" 200 1226 "-" "Go-http-client/2.0"
127.0.0.1 - - [22/May/2023:11:40:00 +0900] "GET /status HTTP/2.0" 200 1226 "-" "Go-http-client/2.0"
2023/05/22 11:40:03 [notice] 1198#0: *92268 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, client: 127.0.0.1, server: kong_admin, request: "POST /config?check_hash=1&flatten_errors=1 HTTP/2.0", host: "localhost:8444"
2023/05/22 11:40:03 [notice] 1198#0: *92268 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, client: 127.0.0.1, server: kong_admin, request: "POST /config?check_hash=1&flatten_errors=1 HTTP/2.0", host: "localhost:8444"
127.0.0.1 - - [22/May/2023:11:40:03 +0900] "POST /config?check_hash=1&flatten_errors=1 HTTP/2.0" 201 94728 "-" "Go-http-client/2.0"
2023/05/22 11:40:03 [notice] 1199#0: *8 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/22 11:40:03 [notice] 1199#0: *8 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/22 11:40:03 [notice] 1198#0: *4 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/22 11:40:03 [notice] 1198#0: *4 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
// ingress controller
time="2023-05-22T11:40:01+09:00" level=info msg="successfully synced configuration to kong" kong_url="https://localhost:8444"
time="2023-05-22T11:40:01+09:00" level=error msg="Reconciler error" error="Operation cannot be fulfilled on httproutes.gateway.networking.k8s.io \"clous-api\": the object has been modified; please apply your changes to the latest version and try again" logger=controllers.HTTPRoute reconcileID="\"4f7db20a-3163-4c26-aac3-1a5c66cecc1e\""
time="2023-05-22T11:40:01+09:00" level=info msg="httproute has been configured on the data-plane" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-ngs-mem-00001404\",\"Name\":\"clous-api\"}" logger=controllers.HTTPRoute name=clous-api namespace=clous-ngs-mem-00001404
time="2023-05-22T11:40:03+09:00" level=error msg="Reconciler error" error="Operation cannot be fulfilled on httproutes.gateway.networking.k8s.io \"clous-api\": the object has been modified; please apply your changes to the latest version and try again" logger=controllers.HTTPRoute reconcileID="\"d904ce6e-26c4-4610-92db-cb2cae6b5463\""
time="2023-05-22T11:40:04+09:00" level=warning msg="no active endpoints" service_name=ras service_namespace=clous-jykim-2305111602 service_port="&ServicePort{Name:ras,Protocol:TCP,Port:80,TargetPort:{1 0 http},NodePort:0,AppProtocol:nil,}"
time="2023-05-22T11:40:04+09:00" level=info msg="no targets could be found for kubernetes service clous-jykim-2305111602/ras" service_name=httproute.clous-jykim-2305111602.clous-api-ras.0
time="2023-05-22T11:40:04+09:00" level=info msg="no targets found to create upstream" service_name=httproute.clous-jykim-2305111602.clous-api-ras.0
time="2023-05-22T11:40:04+09:00" level=warning msg="no active endpoints" service_name=ginad service_namespace=clous-ngs-storage-00001404 service_port="&ServicePort{Name:ginad,Protocol:TCP,Port:80,TargetPort:{1 0 http},NodePort:31837,AppProtocol:nil,}"
time="2023-05-22T11:40:04+09:00" level=info msg="no targets could be found for kubernetes service clous-ngs-storage-00001404/ginad" service_name=httproute.clous-ngs-storage-00001404.clous-api.0
time="2023-05-22T11:40:04+09:00" level=info msg="no targets found to create upstream" service_name=httproute.clous-ngs-storage-00001404.clous-api.0
time="2023-05-22T11:40:04+09:00" level=info msg="successfully synced configuration to kong" kong_url="https://localhost:8444"
time="2023-05-22T11:40:04+09:00" level=info msg="httproute has been configured on the data-plane" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-ngs-storage-00001404\",\"Name\":\"clous-api\"}" logger=controllers.HTTPRoute name=clous-api namespace=clous-ngs-storage-00001404
time="2023-05-22T11:40:05+09:00" level=info msg="httproute has been configured on the data-plane" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-ngs-storage-00001404\",\"Name\":\"clous-api\"}" logger=controllers.HTTPRoute name=clous-api namespace=clous-ngs-storage-00001404
time="2023-05-22T11:40:06+09:00" level=error msg="Reconciler error" error="Operation cannot be fulfilled on httproutes.gateway.networking.k8s.io \"clous-api\": the object has been modified; please apply your changes to the latest version and try again" logger=controllers.HTTPRoute reconcileID="\"7236d421-e56b-49a0-bad4-f782dc7c9efd\""
time="2023-05-22T11:40:08+09:00" level=error msg="Reconciler error" error="Operation cannot be fulfilled on httproutes.gateway.networking.k8s.io \"clous-api\": the object has been modified; please apply your changes to the latest version and try again" logger=controllers.HTTPRoute reconcileID="\"a8c7d6cc-61d9-43c8-b777-483a49314a6e\""
Expected Behavior
Ingress controller reconcile httproute consistently.
Steps To Reproduce
- Install Kong ingress controller with helm.
- Create gatewayclass and gateway
- Create Httproutes and it works.
- After 1 ~ 3 days (Suddenly), Httproutes couldn't be reconciled.
Anything else?
And we can fix it by restarting ingress controller pods. But this behavior was reproduced again.
This is other logs with changing log level to debug.
// proxy container log
127.0.0.1 - - [17/May/2023:01:12:29 +0900] "GET /status HTTP/2.0" 200 1224 "-" "Go-http-client/2.0"
2023/05/17 01:12:32 [notice] 1198#0: *33914 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, client: 127.0.0.1, server: kong_admin, request: "POST /config?check_hash=1&flatten_errors=1 HTTP/2.0", host: "localhost:8444"
2023/05/17 01:12:32 [notice] 1198#0: *33914 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, client: 127.0.0.1, server: kong_admin, request: "POST /config?check_hash=1&flatten_errors=1 HTTP/2.0", host: "localhost:8444"
127.0.0.1 - - [17/May/2023:01:12:32 +0900] "POST /config?check_hash=1&flatten_errors=1 HTTP/2.0" 201 63571 "-" "Go-http-client/2.0"
2023/05/17 01:12:32 [notice] 1199#0: *8 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/17 01:12:32 [notice] 1199#0: *8 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/17 01:12:32 [notice] 1198#0: *4 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/17 01:12:32 [notice] 1198#0: *4 [lua] init.lua:240: purge(): [DB cache] purging (local) cache, context: ngx.timer
2023/05/17 01:12:33 [error] 1198#0: *34149 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 01:43:38 [error] 1199#0: *34788 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 02:43:38 [error] 1198#0: *35832 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 03:43:38 [error] 1199#0: *36869 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 04:43:38 [error] 1198#0: *37833 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 05:43:38 [error] 1199#0: *38983 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 06:43:38 [error] 1198#0: *39841 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 08:43:38 [error] 1198#0: *41917 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 09:43:38 [error] 1198#0: *42962 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 10:43:38 [error] 1198#0: *43966 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 11:43:38 [error] 1199#0: *45193 peer closed connection in SSL handshake, context: ngx.timer
2023/05/17 13:43:38 [error] 1199#0: *47249 peer closed connection in SSL handshake, context: ngx.timer
// ingress-controller log
time="2023-05-17T14:27:32+09:00" level=info msg="processing httproute" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
time="2023-05-17T14:27:32+09:00" level=info msg="checking deletion timestamp" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
time="2023-05-17T14:27:32+09:00" level=info msg="retrieving GatewayClass and Gateway for route" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
time="2023-05-17T14:27:32+09:00" level=info msg="checking if the httproute's gateways are ready" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
time="2023-05-17T14:27:32+09:00" level=info msg="ensuring status contains Gateway associations" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
time="2023-05-17T14:27:32+09:00" level=info msg="ensuring status contains Programmed condition" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
time="2023-05-17T14:27:32+09:00" level=info msg="httproute not configured,requeueing" GatewayV1Beta1HTTPRoute="{\"Namespace\":\"clous-jrpark9-2305171402\",\"Name\":\"clous-search\"}" logger=controllers.HTTPRoute name=clous-search namespace=clous-jrpark9-2305171402
I think this loop has stopped.
https://github.com/Kong/kubernetes-ingress-controller/blob/5d29975d22006311fd5b94e721f2799f2e7b22e6/internal/dataplane/synchronizer.go#L184-L192
@randmonkey Can you help to take a look?
We did some more digging and found a reproduction path. In the process of performing our E2E tests heavily, we found that the phenomenon was reproduced.
Our E2E tests are performed by creating a namespace, deploying httproute and our app in the namespace, and deleting namespace very rapidly. This test can be done again with the same namespace name and it can be done in parallel with multiple namespaces. (This e2e tests are included in our CI)
We haven't found why this test breaks the ingress-controller (in the code). I think the ingress controller should not fail, even if this test is performed heavily.
And I think the key clue is this log. It's a single log from the ingress controller container at the time the problem occurs.
time="2023-05-24T11:17:12+09:00" level=error msg="Reconciler error" error="Operation cannot be fulfilled on httproutes.gateway.networking.k8s.io \"clous-api\": StorageError: invalid object, Code: 4, Key: /registry/gateway.networking.k8s.io/httproutes/clous-ngs-storage-00001415/clous-api, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 4739b314-c874-4953-a024-12d2662fbfdc, UID in object meta: " logger=controllers.HTTPRoute reconcileID="\"6cbb6736-5ca6-43a8-b900-cc4cf977a83f\""
Any have idea?
I'm not sure, but Is this can be related? https://github.com/kubernetes-sigs/controller-runtime/issues/1881
This is our destroy-kong.sh The behavior is reproduced near the end of the script.
user@AL01974378 vks % cat destroy-kong.sh
#!/bin/bash
# Function to create and delete HttpRoute
create_and_delete_httproute() {
for j in {1..100}
do
# Define HttpRoute, Service, and Deployment YAML content as a bash variable.
RESOURCES_YAML=$(cat <<EOF
apiVersion: v1
kind: Namespace
metadata:
name: echo-ns-$1
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
name: echo-tmp-$1
namespace: echo-ns-$1
annotations:
konghq.com/strip-path: 'true'
spec:
parentRefs:
- name: clous-api
namespace: clous-api
rules:
- matches:
- path:
type: PathPrefix
value: /echo-tmp-$1
backendRefs:
- name: echo
kind: Service
port: 1027
---
apiVersion: v1
kind: Service
metadata:
labels:
app: echo
name: echo
namespace: echo-ns-$1
spec:
ports:
- port: 1025
name: tcp
protocol: TCP
targetPort: 1025
- port: 1026
name: udp
protocol: TCP
targetPort: 1026
- port: 1027
name: http
protocol: TCP
targetPort: 1027
selector:
app: echo
EOF
)
# Create HttpRoute, Service, and Deployment
echo "Creating resources in echo-ns-$1... $j steps"
echo "${RESOURCES_YAML}" | kubectl apply -f -
# Delete Namespace
echo "Deleting namespace echo-ns-$1... $j steps"
kubectl delete ns echo-ns-$1
done
}
# Start 40 parallel tasks to create and delete HttpRoutes, Services, and Deployments
for i in {1..40}
do
create_and_delete_httproute $i &
done
# Wait for all
wait
I think i found the reason
https://github.com/parkjeongryul/kubernetes-ingress-controller/commit/e0d7145dc99d93810f7d738482a05af83e68137b
There is kubernetesObjectStatusQueue. (which size is 8192)
But there is only a PUBLISHER(push to chan) for that queue and no SUBSCRIBERS(pop from chan)... (And why do we need this queue?)
So it seems that after a certain GKV's 8192 event (creation or deletion), the kong ingress controller is forever broken.
cc. @randmonkey
@rainest @randmonkey Can you take a look at this issue? If you can confirm that this is a controller issue, I'll try to fix it.
https://github.com/Kong/kubernetes-ingress-controller/issues/4260
Done after v2.10.2.