[bitnami/apisix] failed on restart of container
Name and Version
bitnami/apisix:3.5.0
What architecture are you using?
arm64
What steps will reproduce the bug?
1 install the chart using the values below 2. simulate restart (keep memory requirement low so you will see OOM killed and the container will get recreated within the pod.) 3. when the container gets recreated (not in the new pod) it fails to start with the following error
"2024/10/06 13:26:22 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)"
Are you using any custom parameters or values?
default values with coraza wasm plugin enabled on the control plane & data plane
dataPlane:
extraConfig:
wasm:
plugins:
- name: coraza-filter
priority: 7999
file: /tmp/wasm-plugins/coraza-proxy-wasm.wasm
extraVolumes:
- name: wasm-plugins
emptyDir: {}
extraVolumeMounts:
- name: wasm-plugins
mountPath: "/tmp/wasm-plugins"
initContainers:
- name: attach-wasm-plugins
image: busybox
securityContext:
capabilities:
drop:
- ALL
privileged: false
runAsUser: 1001
runAsGroup: 1001
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
volumeMounts:
- name: wasm-plugins
mountPath: "/tmp/wasm-plugins"
command:
- "sh"
- "-c"
- |
cd /tmp/wasm-plugins ;
wget https://github.com/corazawaf/coraza-proxy-wasm/releases/download/0.5.0/coraza-proxy-wasm-0.5.0.zip ;
unzip coraza-proxy-wasm-0.5.0.zip ;
rm coraza-proxy-wasm-0.5.0.zip
controlPlane: <same_as_above>
What is the expected behavior?
on recreation of container, it should start normally.
What do you see instead?
"2024/10/06 13:26:22 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)"
Additional information
same issue on apisix github repo, here
possible fix -
In Apisix's official helm chart they have a lifecycle hook -
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "sleep 30"
maybe we should also use a prestart hook that will run "rm /usr/local/apisix/logs/worker_events.sock" or maybe use the same apisix approach
Thank you for bringing this issue to our attention. We appreciate your involvement! If you're interested in contributing a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.
Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.
hey @carrodher,
which approach i should use? preStop "sleep 30" OR postStart "rm /usr/local/apisix/logs/worker_events.sock" OR maybe preStop "rm /usr/local/apisix/logs/worker_events.sock". i'll be happy to contribute that small code segment.
currently i am testing preStop "sleep 20" for my installation...
drop in the approach, i'll submit PR
thank you
Hi @maipal-c,
I've a similar problem with my Apisix installation.
Reviewing the chart, it currently support include lifecycle hooks on the data-plane and on the control-plane using the .Values.dataPlane|controlPlane.lifecycleHooks parameter:
lifecycleHooks for the APISIX container(s) to automate configuration before or after startup.
It could fit your needs?
Experiencing the same issue. After "helm apply", everything seems to be fine. But if, for any reason, the control-plane pod has to restart, it will never be able to come back up again.
kubctl logs of the control-plane pod:
Defaulted container "apisix" out of: apisix, wait-for-etcd (init), prepare-apisix (init)
2024/10/21 07:42:00 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
2024/10/21 07:42:00 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
2024/10/21 07:42:00 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
2024/10/21 07:42:00 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
2024/10/21 07:42:00 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
2024/10/21 07:42:00 [emerg] 1#1: still could not bind()
nginx: [emerg] still could not bind()
There is a potential memory leak, leading up to the crash. Then, the pod is not able to restart properly due to the socket binding issue.
Measurements:
2024-10-21 15:38:27 138Mi 2024-10-21 15:38:33 138Mi 2024-10-21 15:38:39 138Mi 2024-10-21 15:38:45 138Mi 2024-10-21 15:38:55 138Mi 2024-10-21 15:39:12 138Mi 2024-10-21 15:49:12 145Mi 2024-10-21 16:14:02 169Mi 2024-10-21 16:14:17 169Mi 2024-10-21 18:01:09 178Mi 2024-10-21 18:11:09 179Mi 2024-10-21 18:21:09 182Mi 2024-10-21 18:27:13 185Mi 2024-10-21 18:28:13 185Mi 2024-10-21 18:29:13 186Mi 2024-10-21 18:30:13 186Mi 2024-10-21 18:31:13 187Mi 2024-10-21 18:32:13 188Mi 2024-10-21 18:33:13 188Mi 2024-10-21 18:34:14 189Mi 2024-10-21 18:35:14 189Mi 2024-10-21 18:36:14 190Mi 2024-10-21 18:37:14 190Mi 2024-10-21 18:38:14 190Mi 2024-10-21 18:39:14 191Mi
a solution that relies on lifecycle hooks did not work for me.
A working solution to get the pod out of crash loop
I used the controlPlane.command and controlPlane.args properties instead, to modify the container's startup command.
Inside values.yaml for the Bitnami Apisix Helm Chart:
controlPlane:
command: ["/bin/sh", "-c"]
args:
- |
if [ -e /usr/local/apisix/logs/worker_events.sock ]; then
echo "Socket file exists. Removing socket file."
rm -f /usr/local/apisix/logs/worker_events.sock
fi
exec openresty -p /usr/local/apisix -g "daemon off;"
Note that this does not prevent the memoryleak itself. It only ensures that the container can be restarted and avoid the situation where it stays in the crash loop.
Closing it as lifecycle hooks working fine (tested for more than 2 weeks)
postStart:
exec:
command:
- /bin/sh
- -c
- |
sleep 5;
rm /usr/local/apisix/logs/worker_events.sock
Thank you everyone. Please Let me know if there are any fixes for this memory leak issue
Problem
Using the solution proposed by @maipal-c, the error is still apearing: The bitnamichart apisix chart used is v.'3.5.2':
Working environment
Cluster nodes:
NAME STATUS ROLES AGE VERSION
node1 Ready control-plane 31d v1.28.14
node2 Ready <none> 31d v1.28.14
- Nodes' operative system: Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-200-generic x86_64)
- Kubernetes version installed: v1.28.14
- Tutorial followed in order to deploy the cluster: https://www.linuxtechi.com/install-kubernetes-on-ubuntu-22-04/?-utm_content=cmp-true
- Installation tool: kubeadm
- Cluster Network Plugin: Calico
- Installation method:
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.2/manifests/calico.yaml
Helm chart configuration
bitnami/apisix chart used:
name: apisix
condition: apisix.enabled
version: 3.3.9
repository: oci://registry-1.docker.io/bitnamicharts
values.yaml
apisix:
...
controlPlane:
enabled: true
lifecycleHooks:
postStart:
exec:
command:
- /bin/sh
- -c
- |
sleep 5;
rm /usr/local/apisix/logs/worker_events.sock
The hook is present as the pod control-plane contains it:
$ kubectl get deploy apisix-control-plane -n apisix -o yaml
apiVersion: apps/v1
kind: Deployment
...
spec:
containers:
- args:
- -p
- /usr/local/apisix
- -g
- daemon off;
command:
- openresty
image: docker.io/bitnami/apisix:3.11.0-debian-12-r0
imagePullPolicy: IfNotPresent
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- |
sleep 5;
rm /usr/local/apisix/logs/worker_events.sock
...
Status of the pods after helm chart deployment:
$ kubectl get pod -n apisix
---
NAME READY STATUS RESTARTS AGE
apisix-control-plane-9588f78df-jhkrh 0/1 CrashLoopBackOff 1 (12s ago) 88s
apisix-dashboard-66b87d67d6-qtvkp 1/1 Running 0 88s
apisix-data-plane-5869c9d7b9-6t787 0/1 Init:0/2 1 (15s ago) 88s
apisix-etcd-0 1/1 Running 0 88s
apisix-ingress-controller-5bb7556955-kgltn 0/1 Init:0/2 1 (15s ago) 88s
Logs of the crashing control-plane
$ kubectl logs -n apisix -f pod/apisix-control-plane-9588f78df-jhkrh -c wait-for-etcd
curl: (7) Failed to connect to apisix-etcd port 2379 after 1029 ms: Couldn't connect to server
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 45 100 45 0 0 19247 0 --:--:-- --:--:-- --:--:-- 22500
{"etcdserver":"3.5.16","etcdcluster":"3.5.0"}
Connected to http://apisix-etcd:2379
Connection success
$ kubectl logs -n apisix -f pod/apisix-control-plane-9588f78df-jhkrh -c apisix
2024/11/14 09:09:55 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
2024/11/14 09:09:55 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
...
hey @cgonzalezITA, i know error is same. but the reason that causing the error is not same.
in my scenario i was facing this error when k8s restart the container (usually on OOM Killed). solution that i proposed that worked because the existing container left behind (unix:/usr/local/apisix/logs/worker_events.sock) open. so postStart suppose to clean it up if it exists.
Now in your case your pods and containers got created for very first time, so there is no way that you will have existing unix sock (unix:/usr/local/apisix/logs/worker_events.sock) Open.
One more thing i will also face the same error on very first spin up, if i run k8s on aws ec2 t series(t4g) instances with coraza proxy filter enabled. either removing coraza proxy filter or switching to other aws instances worked very well for me.
One thing you can try is that to execute "rm /usr/local/apisix/logs/worker_events.sock" this using kubelet exec. it will confirm that we have diff reasons that causing the error
The problem still persists. Perhaps we should add rm /usr/local/apisix/logs/worker_events.sock to the chart? Now health check is pretty much useless without extra config.
It seems like lifecycle hooks don't solve the problem for me as well.
However, the solution proposed by @bradib0y does work.
In my environment, only the data-plane was dying, while the control-plane wasn't.
As proposed by @james-mchugh in this comment, I've used pkill -f -9 apisix to trigger a failure manually.
It should be noted that pkill -f -9 apisix kills both the data-plane and the control-plane. For me, killing both is a bit excessive. Still, it's better to account for this scenario as well.
I am working around the issue with Helm values like this:
dataPlane:
command:
- bash
args:
- '-ec'
- |-
#!/bin/bash
if [ -e /usr/local/apisix/logs/worker_events.sock ]; then
echo "Socket file exists. Removing ..."
rm /usr/local/apisix/logs/worker_events.sock
fi
openresty -p /usr/local/apisix -g "daemon off;"
controlPlane:
command:
- bash
args:
- '-ec'
- |-
#!/bin/bash
if [ -e /usr/local/apisix/logs/worker_events.sock ]; then
echo "Socket file exists. Removing ..."
rm /usr/local/apisix/logs/worker_events.sock
fi
openresty -p /usr/local/apisix -g "daemon off;"
With this workaround in place, the following 3 scenarios seem to be handled well:
- only the data-plane dying, while the control-plane remains alive
- only the control-plane dying, while the data-plane remains alive
- both the data-plane and the control-plane dying
Is there a reason to not put this as a part of the main chart ?