charts icon indicating copy to clipboard operation
charts copied to clipboard

[bitnami/apisix] failed on restart of container

Open maipal-c opened this issue 1 year ago • 3 comments

Name and Version

bitnami/apisix:3.5.0

What architecture are you using?

arm64

What steps will reproduce the bug?

1 install the chart using the values below 2. simulate restart (keep memory requirement low so you will see OOM killed and the container will get recreated within the pod.) 3. when the container gets recreated (not in the new pod) it fails to start with the following error

"2024/10/06 13:26:22 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)"

Are you using any custom parameters or values?

default values with coraza wasm plugin enabled on the control plane & data plane

dataPlane:
  extraConfig:
        wasm:
           plugins:
              - name: coraza-filter
                priority: 7999
                file: /tmp/wasm-plugins/coraza-proxy-wasm.wasm
     extraVolumes:
        - name: wasm-plugins
          emptyDir: {}
     extraVolumeMounts:
        - name: wasm-plugins
          mountPath: "/tmp/wasm-plugins"
     initContainers:
        - name: attach-wasm-plugins
          image: busybox
          securityContext:
             capabilities:
                drop:
                   - ALL
             privileged: false
             runAsUser: 1001
             runAsGroup: 1001
             runAsNonRoot: true
             readOnlyRootFilesystem: true
             allowPrivilegeEscalation: false
             seccompProfile:
                type: RuntimeDefault
          volumeMounts:
             - name: wasm-plugins
               mountPath: "/tmp/wasm-plugins"
          command:
             - "sh"
             - "-c"
             - |
                cd /tmp/wasm-plugins ;
                wget https://github.com/corazawaf/coraza-proxy-wasm/releases/download/0.5.0/coraza-proxy-wasm-0.5.0.zip ;
                unzip coraza-proxy-wasm-0.5.0.zip ;
                rm coraza-proxy-wasm-0.5.0.zip

controlPlane: <same_as_above>

What is the expected behavior?

on recreation of container, it should start normally.

What do you see instead?

"2024/10/06 13:26:22 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)"

Additional information

same issue on apisix github repo, here

possible fix -

In Apisix's official helm chart they have a lifecycle hook -

lifecycle:
  preStop:
    exec:
      command:
        - /bin/sh
        - -c
        - "sleep 30"

maybe we should also use a prestart hook that will run "rm /usr/local/apisix/logs/worker_events.sock" or maybe use the same apisix approach

maipal-c avatar Oct 06 '24 14:10 maipal-c

Thank you for bringing this issue to our attention. We appreciate your involvement! If you're interested in contributing a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.

carrodher avatar Oct 07 '24 10:10 carrodher

hey @carrodher,

which approach i should use? preStop "sleep 30" OR postStart "rm /usr/local/apisix/logs/worker_events.sock" OR maybe preStop "rm /usr/local/apisix/logs/worker_events.sock". i'll be happy to contribute that small code segment.

currently i am testing preStop "sleep 20" for my installation...

drop in the approach, i'll submit PR

thank you

maipal-c avatar Oct 07 '24 12:10 maipal-c

Hi @maipal-c, I've a similar problem with my Apisix installation. Reviewing the chart, it currently support include lifecycle hooks on the data-plane and on the control-plane using the .Values.dataPlane|controlPlane.lifecycleHooks parameter: lifecycleHooks for the APISIX container(s) to automate configuration before or after startup. It could fit your needs?

jorgenll avatar Oct 14 '24 07:10 jorgenll

Experiencing the same issue. After "helm apply", everything seems to be fine. But if, for any reason, the control-plane pod has to restart, it will never be able to come back up again.

kubctl logs of the control-plane pod:

Defaulted container "apisix" out of: apisix, wait-for-etcd (init), prepare-apisix (init)
2024/10/21 07:42:00 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
2024/10/21 07:42:00 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
2024/10/21 07:42:00 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
2024/10/21 07:42:00 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
2024/10/21 07:42:00 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
2024/10/21 07:42:00 [emerg] 1#1: still could not bind()
nginx: [emerg] still could not bind()

bradib0y avatar Oct 21 '24 07:10 bradib0y

There is a potential memory leak, leading up to the crash. Then, the pod is not able to restart properly due to the socket binding issue.

image

Measurements:

2024-10-21 15:38:27 138Mi 2024-10-21 15:38:33 138Mi 2024-10-21 15:38:39 138Mi 2024-10-21 15:38:45 138Mi 2024-10-21 15:38:55 138Mi 2024-10-21 15:39:12 138Mi 2024-10-21 15:49:12 145Mi 2024-10-21 16:14:02 169Mi 2024-10-21 16:14:17 169Mi 2024-10-21 18:01:09 178Mi 2024-10-21 18:11:09 179Mi 2024-10-21 18:21:09 182Mi 2024-10-21 18:27:13 185Mi 2024-10-21 18:28:13 185Mi 2024-10-21 18:29:13 186Mi 2024-10-21 18:30:13 186Mi 2024-10-21 18:31:13 187Mi 2024-10-21 18:32:13 188Mi 2024-10-21 18:33:13 188Mi 2024-10-21 18:34:14 189Mi 2024-10-21 18:35:14 189Mi 2024-10-21 18:36:14 190Mi 2024-10-21 18:37:14 190Mi 2024-10-21 18:38:14 190Mi 2024-10-21 18:39:14 191Mi

bradib0y avatar Oct 21 '24 16:10 bradib0y

a solution that relies on lifecycle hooks did not work for me.

A working solution to get the pod out of crash loop

I used the controlPlane.command and controlPlane.args properties instead, to modify the container's startup command.

Inside values.yaml for the Bitnami Apisix Helm Chart:

controlPlane:
  command: ["/bin/sh", "-c"]
  args:
    - |
      if [ -e /usr/local/apisix/logs/worker_events.sock ]; then
        echo "Socket file exists. Removing socket file."
        rm -f /usr/local/apisix/logs/worker_events.sock
      fi
      exec openresty -p /usr/local/apisix -g "daemon off;"

Note that this does not prevent the memoryleak itself. It only ensures that the container can be restarted and avoid the situation where it stays in the crash loop.

bradib0y avatar Oct 21 '24 23:10 bradib0y

Closing it as lifecycle hooks working fine (tested for more than 2 weeks)

      postStart:
         exec:
            command:
               - /bin/sh
               - -c
               - |
                  sleep 5;
                  rm /usr/local/apisix/logs/worker_events.sock

Thank you everyone. Please Let me know if there are any fixes for this memory leak issue

maipal-c avatar Oct 22 '24 14:10 maipal-c

Problem

Using the solution proposed by @maipal-c, the error is still apearing: The bitnamichart apisix chart used is v.'3.5.2':

Working environment

Cluster nodes:
NAME       STATUS                        ROLES           AGE   VERSION
node1   Ready                         control-plane   31d   v1.28.14
node2   Ready                         <none>          31d   v1.28.14
  • Nodes' operative system: Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-200-generic x86_64)
  • Kubernetes version installed: v1.28.14
  • Tutorial followed in order to deploy the cluster: https://www.linuxtechi.com/install-kubernetes-on-ubuntu-22-04/?-utm_content=cmp-true
  • Installation tool: kubeadm
  • Cluster Network Plugin: Calico
  • Installation method:
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.2/manifests/calico.yaml

Helm chart configuration

bitnami/apisix chart used:

name: apisix
    condition: apisix.enabled
    version: 3.3.9
    repository: oci://registry-1.docker.io/bitnamicharts

values.yaml

apisix:
...
  controlPlane:
    enabled: true
    lifecycleHooks:
      postStart:
         exec:
           command:
             - /bin/sh
             - -c
             - |
                sleep 5;
                rm /usr/local/apisix/logs/worker_events.sock

The hook is present as the pod control-plane contains it:

$ kubectl get deploy apisix-control-plane -n apisix -o yaml
apiVersion: apps/v1
kind: Deployment
...
spec:
  containers:
    - args:
      - -p
      - /usr/local/apisix
      - -g
      - daemon off;
      command:
      - openresty
      image: docker.io/bitnami/apisix:3.11.0-debian-12-r0
      imagePullPolicy: IfNotPresent
      lifecycle:
        postStart:
          exec:
            command:
            - /bin/sh
            - -c
            - |
              sleep 5;
              rm /usr/local/apisix/logs/worker_events.sock
...

Status of the pods after helm chart deployment:

$ kubectl get pod  -n apisix
---
NAME                                         READY   STATUS             RESTARTS      AGE
apisix-control-plane-9588f78df-jhkrh         0/1     CrashLoopBackOff   1 (12s ago)   88s
apisix-dashboard-66b87d67d6-qtvkp            1/1     Running            0             88s
apisix-data-plane-5869c9d7b9-6t787           0/1     Init:0/2           1 (15s ago)   88s
apisix-etcd-0                                1/1     Running            0             88s
apisix-ingress-controller-5bb7556955-kgltn   0/1     Init:0/2           1 (15s ago)   88s

Logs of the crashing control-plane

$ kubectl logs -n apisix -f pod/apisix-control-plane-9588f78df-jhkrh -c wait-for-etcd
curl: (7) Failed to connect to apisix-etcd port 2379 after 1029 ms: Couldn't connect to server
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    45  100    45    0     0  19247      0 --:--:-- --:--:-- --:--:-- 22500
{"etcdserver":"3.5.16","etcdcluster":"3.5.0"}
Connected to http://apisix-etcd:2379
Connection success

$ kubectl logs -n apisix -f pod/apisix-control-plane-9588f78df-jhkrh -c apisix
2024/11/14 09:09:55 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
2024/11/14 09:09:55 [emerg] 1#1: bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/usr/local/apisix/logs/worker_events.sock failed (98: Address already in use)
...

cgonzalezITA avatar Nov 14 '24 09:11 cgonzalezITA

hey @cgonzalezITA, i know error is same. but the reason that causing the error is not same.

in my scenario i was facing this error when k8s restart the container (usually on OOM Killed). solution that i proposed that worked because the existing container left behind (unix:/usr/local/apisix/logs/worker_events.sock) open. so postStart suppose to clean it up if it exists.

Now in your case your pods and containers got created for very first time, so there is no way that you will have existing unix sock (unix:/usr/local/apisix/logs/worker_events.sock) Open.

One more thing i will also face the same error on very first spin up, if i run k8s on aws ec2 t series(t4g) instances with coraza proxy filter enabled. either removing coraza proxy filter or switching to other aws instances worked very well for me.

One thing you can try is that to execute "rm /usr/local/apisix/logs/worker_events.sock" this using kubelet exec. it will confirm that we have diff reasons that causing the error

maipal-c avatar Nov 14 '24 11:11 maipal-c

The problem still persists. Perhaps we should add rm /usr/local/apisix/logs/worker_events.sock to the chart? Now health check is pretty much useless without extra config.

pio2398 avatar Jun 10 '25 04:06 pio2398

It seems like lifecycle hooks don't solve the problem for me as well.

However, the solution proposed by @bradib0y does work.


In my environment, only the data-plane was dying, while the control-plane wasn't.

As proposed by @james-mchugh in this comment, I've used pkill -f -9 apisix to trigger a failure manually.

It should be noted that pkill -f -9 apisix kills both the data-plane and the control-plane. For me, killing both is a bit excessive. Still, it's better to account for this scenario as well.


I am working around the issue with Helm values like this:

dataPlane:
  command:
  - bash
  args:
  - '-ec'
  - |-
    #!/bin/bash
    if [ -e /usr/local/apisix/logs/worker_events.sock ]; then
        echo "Socket file exists. Removing ..."
        rm /usr/local/apisix/logs/worker_events.sock
    fi
    openresty -p /usr/local/apisix -g "daemon off;"

controlPlane:
  command:
  - bash
  args:
  - '-ec'
  - |-
    #!/bin/bash
    if [ -e /usr/local/apisix/logs/worker_events.sock ]; then
        echo "Socket file exists. Removing ..."
        rm /usr/local/apisix/logs/worker_events.sock
    fi
    openresty -p /usr/local/apisix -g "daemon off;"

With this workaround in place, the following 3 scenarios seem to be handled well:

  • only the data-plane dying, while the control-plane remains alive
  • only the control-plane dying, while the data-plane remains alive
  • both the data-plane and the control-plane dying

spantaleev avatar Jun 23 '25 10:06 spantaleev

Is there a reason to not put this as a part of the main chart ?

fredleger avatar Aug 28 '25 06:08 fredleger