KIC fails to start. All pods down: nginx [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)"
Is there an existing issue for this?
- [X] I have searched the existing issues
Kong version ($ kong version)
3.8.0
Current Behavior
hello, We run kong KIC on GKE clusters: every night the preemptible nodes are reclaimed in our staging envs. And most of the time, it takes down all kong gateway pods (2 replicas) for hours.
versions
- GKE control plane & node pools:
1.30.4-gke.1348000 - kong ingress chart
0.14.1, - controller:
3.3.1 - gateway:
3.8.0
Additional info
- db-less mode
- using the Gateway API and Gateway resources only (e.g. HTTPRoutes)
- no istio sidecars (they have been removed to try to narrow down the issue)
It seems that the liveness probe is responding ok, while the readiness probe remains unhealthy, leading to the gateway pods to just remain around, not able to process traffic.
Error logs
ERROR 2024/10/05 00:06:53 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
ERROR nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
[repeats over and over, yet the pod is not killed]
The controller fails to talk to the gateways with
ERROR 2024-10-07T03:37:26.870415241Z [resource.labels.containerName: ingress-controller] Error: could not retrieve Kong admin root(s): making HTTP request: Get "https://10.163.37.7:8444/": dial tcp 10.163.37.7:8444: connect: connection refused
Kong finds itself in some sort of "deadlock" until the pods are deleted manually. Any insights ?
Below is the values.yaml file configuring kong
ingress:
deployment:
test:
enabled: false
controller:
enabled: true
proxy:
nameOverride: "{{ .Release.Name }}-gateway-proxy"
postgresql:
enabled: false
env:
database: "off"
deployment:
kong:
enabled: false
ingressController:
enabled: true
image:
repository: kong/kubernetes-ingress-controller
tag: "3.3.1"
pullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
memory: 1G
ingressClass: kong-green
env:
log_format: json
log_level: error
ingress_class: kong-green
gateway_api_controller_name: konghq.com/kong-green
gatewayDiscovery:
enabled: true
generateAdminApiService: true
podAnnotations:
sidecar.istio.io/inject: "false"
gateway:
enabled: true
deployment:
kong:
enabled: true
image:
repository: kong
tag: "3.8.0"
pullPolicy: IfNotPresent
resources:
requests:
cpu: 250m
memory: 500Mi
limits:
memory: 2G
replicaCount: 6
podAnnotations:
sidecar.istio.io/inject: "false"
proxy:
enabled: true
type: ClusterIP
annotations:
konghq.com/protocol: "https"
cloud.google.com/neg: '{"exposed_ports": {"80":{"name": "neg-kong-green"}}}'
http:
enabled: true
servicePort: 80
containerPort: 8000
parameters: []
tls:
enabled: true
servicePort: 443
containerPort: 8443
parameters:
- http2
appProtocol: ""
ingressController:
enabled: false
postgresql:
enabled: false
env:
role: traditional
database: "off"
proxy_access_log: "off"
# proxy_error_log: "off"
proxy_stream_access_log: "off"
# proxy_stream_error_log: "off"
admin_access_log: "off"
# admin_error_log: "off"
status_access_log: "off"
# status_error_log: "off"
log_level: warn
headers: "off"
request_debug: "off"
Expected Behavior
kong gateway pods, either
- don't fail with the error above
bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use) - or at least are able to recover from it by failing the liveness probe (or else)
Steps To Reproduce
I could reproduce the error by killing the nodes (kubectl delete nodes) on which the kong pods were running. After killing the nodes, KIC fails to restart as it enters the deadlock situation described above. See screenshot:
Anything else?
dump of a failing gateway pod: kubectl describe:
k -n kong-dbless describe po kong-green-gateway-68f467ff98-qztm5
Name: kong-green-gateway-68f467ff98-qztm5
Namespace: kong-dbless
Priority: 0
Service Account: kong-green-gateway
Node: ---
Start Time: Mon, 07 Oct 2024 13:49:02 +0200
Labels: app=kong-green-gateway
app.kubernetes.io/component=app
app.kubernetes.io/instance=kong-green
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=gateway
app.kubernetes.io/version=3.6
helm.sh/chart=gateway-2.41.1
pod-template-hash=68f467ff98
version=3.6
Annotations: cni.projectcalico.org/containerID: 13864002653403e75b1ddb3ef661b5665f69e3b97c266b5833042f8dc4a4f39b
cni.projectcalico.org/podIP: 10.163.33.135/32
cni.projectcalico.org/podIPs: 10.163.33.135/32
kuma.io/gateway: enabled
kuma.io/service-account-token-volume: kong-green-gateway-token
sidecar.istio.io/inject: false
traffic.sidecar.istio.io/includeInboundPorts:
Status: Running
IP: 10.163.33.135
IPs:
IP: 10.163.33.135
Controlled By: ReplicaSet/kong-green-gateway-68f467ff98
Init Containers:
clear-stale-pid:
Container ID: containerd://ed0b35719cd87e11e849b42f20f1f328b1e2d63612d004b313ba981eda0bd790
Image: kong:3.8.0
Image ID: docker.io/library/kong@sha256:616b2ab5a4c7b6c14022e8a1495ff34930ced76f25f3d418e76758717fec335f
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
Command:
rm
-vrf
$KONG_PREFIX/pids
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 07 Oct 2024 13:49:20 +0200
Finished: Mon, 07 Oct 2024 13:49:21 +0200
Ready: True
Restart Count: 0
Limits:
memory: 2G
Requests:
cpu: 250m
memory: 500Mi
Environment:
KONG_ADMIN_ACCESS_LOG: /dev/stdout
KONG_ADMIN_ERROR_LOG: /dev/stderr
KONG_ADMIN_GUI_ACCESS_LOG: /dev/stdout
KONG_ADMIN_GUI_ERROR_LOG: /dev/stderr
KONG_ADMIN_LISTEN: 0.0.0.0:8444 http2 ssl, [::]:8444 http2 ssl
KONG_CLUSTER_LISTEN: off
KONG_DATABASE: off
KONG_LUA_PACKAGE_PATH: /opt/?.lua;/opt/?/init.lua;;
KONG_NGINX_WORKER_PROCESSES: 2
KONG_PLUGINS: ---
KONG_PORTAL_API_ACCESS_LOG: /dev/stdout
KONG_PORTAL_API_ERROR_LOG: /dev/stderr
KONG_PORT_MAPS: 80:8000, 443:8443
KONG_PREFIX: /kong_prefix/
KONG_PROXY_ACCESS_LOG: /dev/stdout
KONG_PROXY_ERROR_LOG: /dev/stderr
KONG_PROXY_LISTEN: 0.0.0.0:8000, [::]:8000, 0.0.0.0:8443 http2 ssl, [::]:8443 http2 ssl
KONG_PROXY_STREAM_ACCESS_LOG: /dev/stdout basic
KONG_PROXY_STREAM_ERROR_LOG: /dev/stderr
KONG_ROLE: traditional
KONG_ROUTER_FLAVOR: traditional
KONG_STATUS_ACCESS_LOG: off
KONG_STATUS_ERROR_LOG: /dev/stderr
KONG_STATUS_LISTEN: 0.0.0.0:8100, [::]:8100
KONG_STREAM_LISTEN: off
Mounts:
/kong_prefix/ from kong-green-gateway-prefix-dir (rw)
/opt/kong/plugins/---
/opt/kong/plugins/---
/opt/kong/plugins/---
/opt/kong/plugins/---
/opt/kong/plugins/---
/tmp from kong-green-gateway-tmp (rw)
Containers:
proxy:
Container ID: containerd://0ed944478d25423c08c85146ed1528ae668d128f13bddaf6402990701e2ea3a1
Image: kong:3.8.0
Image ID: docker.io/library/kong@sha256:616b2ab5a4c7b6c14022e8a1495ff34930ced76f25f3d418e76758717fec335f
Ports: 8444/TCP, 8000/TCP, 8443/TCP, 8100/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP
SeccompProfile: RuntimeDefault
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 07 Oct 2024 13:59:39 +0200
Finished: Mon, 07 Oct 2024 13:59:49 +0200
Ready: False
Restart Count: 7
Limits:
memory: 2G
Requests:
cpu: 250m
memory: 500Mi
Liveness: http-get http://:status/status delay=5s timeout=5s period=10s #success=1 #failure=3
Readiness: http-get http://:status/status/ready delay=5s timeout=5s period=10s #success=1 #failure=3
Environment:
KONG_ADMIN_ACCESS_LOG: /dev/stdout
KONG_ADMIN_ERROR_LOG: /dev/stderr
KONG_ADMIN_GUI_ACCESS_LOG: /dev/stdout
KONG_ADMIN_GUI_ERROR_LOG: /dev/stderr
KONG_ADMIN_LISTEN: 0.0.0.0:8444 http2 ssl, [::]:8444 http2 ssl
KONG_CLUSTER_LISTEN: off
KONG_DATABASE: off
KONG_LUA_PACKAGE_PATH: /opt/?.lua;/opt/?/init.lua;;
KONG_NGINX_WORKER_PROCESSES: 2
KONG_PLUGINS: ---
KONG_PORTAL_API_ACCESS_LOG: /dev/stdout
KONG_PORTAL_API_ERROR_LOG: /dev/stderr
KONG_PORT_MAPS: 80:8000, 443:8443
KONG_PREFIX: /kong_prefix/
KONG_PROXY_ACCESS_LOG: /dev/stdout
KONG_PROXY_ERROR_LOG: /dev/stderr
KONG_PROXY_LISTEN: 0.0.0.0:8000, [::]:8000, 0.0.0.0:8443 http2 ssl, [::]:8443 http2 ssl
KONG_PROXY_STREAM_ACCESS_LOG: /dev/stdout basic
KONG_PROXY_STREAM_ERROR_LOG: /dev/stderr
KONG_ROLE: traditional
KONG_ROUTER_FLAVOR: traditional
KONG_STATUS_ACCESS_LOG: off
KONG_STATUS_ERROR_LOG: /dev/stderr
KONG_STATUS_LISTEN: 0.0.0.0:8100, [::]:8100
KONG_STREAM_LISTEN: off
KONG_NGINX_DAEMON: off
Mounts:
/kong_prefix/ from kong-green-gateway-prefix-dir (rw)
/opt/kong/plugins/---
/opt/kong/plugins/---
/opt/kong/plugins/---
/opt/kong/plugins/---
/opt/kong/plugins/---
/tmp from kong-green-gateway-tmp (rw)
Readiness Gates:
Type Status
cloud.google.com/load-balancer-neg-ready True
Conditions:
Type Status
cloud.google.com/load-balancer-neg-ready True
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kong-green-gateway-prefix-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: 256Mi
kong-green-gateway-tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: 1Gi
kong-green-gateway-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 11m kubelet Container proxy failed liveness probe, will be restarted
Warning FailedPreStopHook 10m kubelet PreStopHook failed
Normal Pulled 10m (x2 over 11m) kubelet Container image "kong:3.8.0" already present on machine
Normal Created 10m (x2 over 11m) kubelet Created container proxy
Normal Started 10m (x2 over 11m) kubelet Started container proxy
Warning Unhealthy 10m (x4 over 11m) kubelet Liveness probe failed: Get "http://10.163.33.135:8100/status": dial tcp 10.163.33.135:8100: connect: connection refused
Warning Unhealthy 10m (x9 over 11m) kubelet Readiness probe failed: Get "http://10.163.33.135:8100/status/ready": dial tcp 10.163.33.135:8100: connect: connection refused
Warning BackOff 114s (x26 over 7m19s) kubelet Back-off restarting failed container proxy in pod kong-green-gateway-68f467ff98-qztm5_kong-dbless(ab152a94-7ef0-4de0-b84c-1eb419327b88)
and logs
2024/10/07 12:05:10 [warn] 1#0: the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /kong_prefix/nginx.conf:7
nginx: [warn] the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /kong_prefix/nginx.conf:7
2024/10/07 12:05:14 [notice] 1#0: [lua] init.lua:791: init(): [request-debug] token for request debugging: ccbb05a0-6e76-4cb7-9e5d-346690a3c69f
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: still could not bind()
nginx: [emerg] still could not bind()
FYI: the "deadlock" is removed by restarting the pods manually
kubectl -n kong-dbless delete pods --selector=app.kubernetes.io/instance=kong-green
then kong KIC pods (controller and gateways) restart normally.
It seems your issue has been resolved. Feel free to reopen if you have any further concern.
thanks @StarlightIbuki for taking this issue. However, your answer doesn't help much. Could you please point me to the resolution ? How has this issue been solved ? And what's the fix ?
Thank you in advance
thanks @StarlightIbuki for taking this issue. However, your answer doesn't help much. Could you please point me to the resolution ? How has this issue been solved ? And what's the fix ?
Thank you in advance
Sorry I thought you had found the solution. @randmonkey Could you also take a look into this?
hi @randmonkey,
we are getting the issue above multiple times per days and it's getting very frustrating. Do you have any insights to share ?
On my side, I've also been searching for solutions. And a closer look at the clear-stale-pid initContainer might reveal a bug: it seems that there are 2 // in the rm command (see screenshot below)
behaviour wise, the liveness probe is failing, which only restarts the container. Restarting the container doesn't help. Kong is able to start only when the pod is deleted (manually), which leads me towards cleaning up the PID
the issue seems to be the same as https://github.com/Kong/kubernetes-ingress-controller/issues/5324
I have the following hypothesis on what is happening
- a gateway pod starts (fresh new pod)
- the controller fails to push the config (due to discovery and potentially DNS issues in GKE)
- the gateway pods is therefore not healthy
- the liveness probe fails
- GKE restarts the container
- the PID is not cleaned (or whatever else)
- the container fails with "address already in use"
- the container never recovers, less killed manually with a kubectl delete
The /kong_prefix/sockets/we is the path of the socket for worker events. The old socket may not get cleared because the clear-stale-pid does not touch the path /kong_prefix/sockets/ .
a gateway pod starts (fresh new pod) the controller fails to push the config (due to discovery and potentially DNS issues in GKE) the gateway pods is therefore not healthy the liveness probe fails GKE restarts the container the PID is not cleaned (or whatever else) the container fails with "address already in use" the container never recovers, less killed manually with a kubectl delete
5-8 would be the possible reason of the issue. For 1-4, KIC failing to push the config will not make the liveness probe fail and then restart the gateway pod.
:wave: I think I have some insight on this. In 3.8, we relocated Kong's internal sockets into a subdirectory in the prefix tree (#13409).
There is some code that runs as part of kong start that cleans up dangling sockets that might be left over from an unclean shutdown of Kong.
This logic is unfortunately duplicated in our docker entrypoint script because it circumvents kong start and invokes nginx directly.
The docker entrypoint code was not updated to point to the new socket directory that Kong is using as of 3.8 (an oversight). I've opened a PR to remedy this, which I think should resolve the issue.
For those using the clear-stale-pid init container pattern that I see in some of the comments, it can be updated to remove $KONG_PREFIX/sockets (in addition to $KONG_PREFIX/pids) to mitigate the issue in the meantime.* The docker entrypoint runs kong prepare, so it will recreate the sockets directory as needed.
*In fact, enabling this kind of ops pattern in 3.8 was part of the underlying intent of #13409: establishing more segregation between persistent and transient data so that lifecycle management doesn't require non-trivial amounts of scripting (like what is found in the aforementioned docker entrypoint).
hello @flrgh , thanks for looking into it. In my KIC helm chart, I added the following
ingress:
controller:
# controller config
gateway:
enabled: true
deployment:
kong:
enabled: true
initContainers:
- command:
- rm
- '-vrf'
- ${KONG_PREFIX}/sockets
env:
- name: KONG_PREFIX
value: /kong_prefix/
image: kong:3.8.0
imagePullPolicy: IfNotPresent
name: clear-stale-pid-custom
volumeMounts:
- mountPath: /kong_prefix/
name: kong-green-gateway-prefix-dir
when our preemptible node got "restarted" just a few minutes ago. Kong was not able to restart properly and crash again with the errors
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] still could not bind()
@joran-fonjallaz that's odd.
My k8s knowledge is minimal, so bear with me a little. If /kong_prefix/sockets still exists and contains sockets when starting the primary Kong container, I feel like these are the probable causes to investigate:
- The Kong container is started before the init container has completed
- The init container was not actually executed
- The init container encountered an error
- The volume that is mounted in the init container (
kong-green-gateway-prefix-dir) is not the same as the one that is mounted in the Kong container - Something else (another container?) is creating socket files in
kong-green-gateway-prefix-dirafter the init container but before the Kong container
hello @flrgh,
not sure about the above although the list does seem exhaustive. I don't actually have a ton of time to dedicate to this issue at the moment. However, reverting the gateway to 3.7 seems to have solve the issue: we didn't got the error above (98: Address already in use) since Friday. Whereas it would occur at least once a day with 3.8.
So your feeling that the issue might be linked to 3.8 does seem correct
Confirm! Sometimes application pod can't start up because of this error (Address already in use). For me it happens often on tightly packed cluster, used for app review. What happens here? In my case most often database became unavailable. Because of it (probably) container proxy restarted.
- Mentioned volume
kong-green-gateway-prefix-dirwhich mounted at/kong_prefix/is EmptyDir volume, it is created at pod startup. - Init container
clear-stale-pidin this case is useless - it do nothing, this path is already clear by design. - When container
proxyfailed for some reason it restarted, PID files and sockets stored at/kong_prefix/remains intact (init containerclear-stale-pidstarted only once at pod initialization) - ~~Profit!~~
Address already in use
Probably its not that easy, in most cases container proxy just startup again. Probably it became broken after several restarts in a row, will try to figure it out. You can see logs from container for few restarts (there is no way to say which restarts they belongs too, maybe by differences in timestamps): proxy.csv
have you tried kong 3.9? There was an update to the entrypoint. https://github.com/Kong/docker-kong/pull/724
It should help! I'll try it
чт, 16 янв. 2025 г. в 21:56, Steve Jacobs @.***>:
have you tried kong 3.9? There was an update to the entrypoint. Kong/docker-kong#724 https://github.com/Kong/docker-kong/pull/724
— Reply to this email directly, view it on GitHub https://github.com/Kong/kong/issues/13730#issuecomment-2596872830, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHROZVTMRGQDWP3Z4EFH6D2LAMJDAVCNFSM6AAAAABPPOXR52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJWHA3TEOBTGA . You are receiving this because you commented.Message ID: @.***>
I'm on the last kubernetes dashboard 7.10.3 and they use kong as http dispatcher.
helm.sh/chart: kong-2.46.0
version: 3.8
This issue was getting me crazy. I test 3.9. thanks
We are facing the same issue. We are using latest kubernetes-dashboard which inturn uses kong. Whenever the cluster nodes are restarted (may it be my windows machine running docker-desktop with single node k8s, or may it be our 3 node dev env k8s cluster running on latest Ubuntu and containerD), I get this same issue
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
Once I delete the kong pod, all comes back to normalcy. Will appreciate a fix for this.
Hi @gerardnico , from your comments, it looked like that you were about to try, but it didnt mention if it worked. Thanks for clarifying and I'm relieved that it works. I'm hopeful that k8s-dashboard will soon release latest version which uses Kong 3.9, which should fix this issue.
We also tried k8s-dashboard with Kong 3.9 and can confirm that the issue doesnt occur anymore.
helm upgrade --install kubernetes-dashboard kubernetes-dashboard/kubernetes-dashboard \
--create-namespace \
--namespace kubernetes-dashboard \
--set kong.image.repository=kong \
--set kong.image.tag="3.9.0"
Have created an issue with k8s-dashboard:
https://github.com/kubernetes/dashboard/issues/9955
Do we have any workaround for 3.8.0 version? it not possible to me to update it to 3.9 as of now.
@gerardnico @brokenjacobs @flrgh @baznikin @randmonkey
Hey Do we have any update on this. Do we have any workaround for 3.8.0 version? it not possible to me to update it to 3.9 as of now.
@gerardnico @brokenjacobs @flrgh @baznikin @randmonkey @KongGuide
@shashank-altan can you stop bothering us. If you can't find the answer yourself, go find another job.
@shashank-altan I'm sorry to hear that you are having issues. Unfortunately the kind of support that we can provide on the open source project is limited. In general, we can't provide backports for fixes/features to older versions of the gateway. Even on the Enterprise version of the gateway, backports are rare. Very often our suggestion is for the user to do an update. If you are not capable of updating, there's very little we can do to help you.
Please refrain from pinging others repeatedly on issues. Especially if they are community users like yourself. (@flrgh and @randmonkey are Kong employees, the other people that you are pinging are not).
Closing as fixed in 3.9 by https://github.com/Kong/docker-kong/pull/724.