apisix help request: Apisix ETCD going into Crash loop back off

Description

Hello, I have deployed apisix 2.7.0 Helm chart and out of three etcd pods, two are going into crash loop back off error which affects the ingress created for other deployments.

The logs show the following details,

Master (etcd pod in running state) "msg":"rejected stream from remote peer because it was removed","local-member-id"

Other pods (etcd pods in crash loop back off state) "failed to publish local member to cluster through raft","local-member-id":"2c16fb63879f0d98","local-member-attributes":"{Name:apisix-etcd-1 ClientURLs:[http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2379/ http://apisix-etcd.apisix.svc.cluster.local:2379]}","request-path":"/0/members/2c16fb63879f0d98/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"

Currently stuck in this, let me know if anyone has faced this and has any fix for this

Environment

APISIX version (run apisix version): 2.7.0

Jun 06 '24 05:06 Lakshmi2k1

Do you have a strong requirement to use 2.7.0? I'm using the latest and pods are starting normally.

Jun 06 '24 08:06 kayx23

I think you need to try etcdctl member list first. This will help you verify if the member ID of the crashing pods matches the IDs from the etcdctl.

Jun 06 '24 08:06 flearc

Do you have a strong requirement to use 2.7.0? I'm using the latest and pods are starting normally.

The most recent version is 2.8.0 (released in Jun 04, 2024). So, was using one version prior to that which got released on April. May I know which version of helm chart you're using?

Jun 07 '24 04:06 Lakshmi2k1

I think you need to try etcdctl member list first. This will help you verify if the member ID of the crashing pods matches the IDs from the etcdctl.

If it's cluster's etcd then we have to login into the node and execute the commands, since here it is running as pod, not sure where to execute etcdctl commands and also as the pods are in crash loop back off, I can't even exec into the pods.

Jun 07 '24 04:06 Lakshmi2k1

There is one running etcd pod, run etcdctl member list after exec into the pod. And check the logs of crashed etcd pods, normally there was member id it used.

BTW, I think it's more likely a etcd problem.

Jun 07 '24 05:06 flearc

There is one running etcd pod, run etcdctl member list after exec into the pod. And check the logs of crashed etcd pods, normally there was member id it used.

BTW, I think it's more likely a etcd problem.

Hello, I have tried it, this is what i got

I have no name!@apisix-etcd-2:/opt/bitnami/etcd$ etcdctl member list 3ff1b5cd453a87df, started, apisix-etcd-2, http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380, http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2379,http://apisix-etcd.apisix.svc.cluster.local:2379, false and

this is the member id I found in the logs of crashing pods [local-member-id":"2c16fb63879f0d98"]. I also tried disabling apisix etcd and used an external etcd but it was not able to integrate with the running etcd pod. I'm trying to fix that. Share if anything you know could help.

Jun 07 '24 11:06 Lakshmi2k1

any solutions for this issue iam also facing the same issue for last 3 days

Jun 10 '24 06:06 Thilip707

any solutions for this issue iam also facing the same issue for last 3 days

I have changed the etcd version in chart.yaml to "10.1.0". Now all pods are in running state. I'm checking few things in UI to make sure everything is working fine. If you are using Helm chart for deploying apisix, try this.

Jun 10 '24 06:06 Lakshmi2k1

thanks will try and update mam

Jun 10 '24 06:06 Thilip707

Apisix is working fine after upgrading the version of etcd in chart.yaml as "10.1.0". So, closing this issue.

Jun 11 '24 11:06 Lakshmi2k1

Hi @Lakshmi2k1 Still are you facing the same issue? Need some suggestion on it. We have upgraded to 10.2.6 but still facing the same issue

Jul 15 '24 16:07 sudhir649

Still having the same issue. We downloaded and added the entire chart dir, setting the etcd version in chart.yaml to "10.1.0" as suggested by @Lakshmi2k1

Are there any plans to have this fixed?

Jul 25 '24 19:07 BadTorro

Hi @BadTorro try to enable the disaster recovery cron job.

Jul 25 '24 19:07 sudhir649

How do you mean? Do you have some more specifics to that? We're currently using it on a local development environment and etcd boots with 3 nodes, but 2 always keep failing. From time to time I need to shutdown the entire environment and restart it again to get it working again.

Using it within https://tilt.dev/

thanks

Jul 25 '24 19:07 BadTorro

Hi @BadTorro ,

I have found as of now two solution for it.

Intermittent solution for it delete the all three pvc and restart the pod.
check the readme.md file document in bitanami/etcd folder where they have explained how to enable the disaterrecovery cronjob. In disaster recovery there is a cronjob it will take the back of pvc, if more than (n-1)/2 pods are failing then pods will automatically come back to running status with the help of backup pvc. I have implememted the disaster recovery in my env, now I have seen that 2 pods are still failing and try to come back in running status , logs are also changed but unfortunately they are still not able to come back in running status. But once the third pod got failed all the three pods are automatically come back to running status with the help of backup pvc. So extract the etcd zip folder and in values. yaml enable the cronjob and zip it and redeploy it

Jul 25 '24 20:07 sudhir649

@sudhir649 thanks for the tipp, need to verify - seems like I need an nfs storage provider to get the snapshot image to work.

Jul 26 '24 06:07 BadTorro

@BadTorro yes, tilt is working on local machine so you need to deploy nfs storage class.

By the way lakshmi solution won't work for me

Jul 26 '24 08:07 sudhir649

@sudhir649 @BadTorro Deploying new version of etcd worked for me initially but whenever the node is scaled down and scaled up, one of the etcd pods is going into crashloop. But since other two etcd pods were running, it wasn't affecting the route and upstream creations. Yet, we are about to use it in production environment, so wish there is a permanent fix. After reading the solution @sudhir649 you have pointed out, I have few questions. 1. Won't deleting the PVC cause loss of data that apisix needs. 2. For second solution, I feel it's good to give a try. (n-1)/2 , in my case number of etcd replicas is 3, so according to this even if one pod is crashing, the disaster recovery cron will run and take backup of pvc. But as you mentioned when two pods were crashing there was no change, when third pod also crashed then all three pods came to running state. But in my case only one or rarely two pods crashing. If you have any inputs let me know. Thanks in advance!

Jul 27 '24 07:07 Lakshmi2k1

@Lakshmi2k1

for deleting the pvc it's depends what data are you storing into it. In my case Or generally we are storing the routes only so if I deleted it will restore again once new pvc is created.
In the documentaion they have mentioned more than (n-1) /2 .It means when more than 1 pod (atleast 2 if you have 3 etcd pods) will fail then automatically pods will try to recover. Recently in our QA env all the pods were down so it's better to implement disater recovery.

Jul 27 '24 07:07 sudhir649

@sudhir649 Thanks Sudhir, I'll try the same from my end.

Jul 29 '24 03:07 Lakshmi2k1

@sudhir649, We are facing one more error in apisix. We use openid-connect plugin for authentication and authorization in the ApisixPluginConfig. When we try to hit the ingress of application, it gives 431 (Header too large) error. We tried removing few headers but it was breaking UI of application, so is there a way to solve this? Have you come across similar issue before?

Jul 29 '24 05:07 Lakshmi2k1

@sudhir649, We are facing one more error in apisix. We use openid-connect plugin for authentication and authorization in the ApisixPluginConfig. When we try to hit the ingress of application, it gives 431 (Header too large) error. We tried removing few headers but it was breaking UI of application, so is there a way to solve this? Have you come across similar issue before?

did u use nginx? if u use nginx add this in nginx client_max_body_size 2G;

Jul 29 '24 06:07 Thilip707

@Thilip707

We are using the below configuration in apisix configmap as mentioned in docs

Jul 29 '24 06:07 Lakshmi2k1

just increase client size and check it will work

Jul 29 '24 06:07 Thilip707

@sudhir649 @BadTorro Deploying new version of etcd worked for me initially but whenever the node is scaled down and scaled up, one of the etcd pods is going into crashloop. But since other two etcd pods were running, it wasn't affecting the route and upstream creations. Yet, we are about to use it in production environment, so wish there is a permanent fix. After reading the solution @sudhir649 you have pointed out, I have few questions. 1. Won't deleting the PVC cause loss of data that apisix needs. 2. For second solution, I feel it's good to give a try. (n-1)/2 , in my case number of etcd replicas is 3, so according to this even if one pod is crashing, the disaster recovery cron will run and take backup of pvc. But as you mentioned when two pods were crashing there was no change, when third pod also crashed then all three pods came to running state. But in my case only one or rarely two pods crashing. If you have any inputs let me know. Thanks in advance!

Regarding to that, I managed to get it work by basically:

Deploying longhorn storage solution to the cluster
Configured rancher desktop based on this guide to have open-iscsi in place and useable
changed the storageclass in the dedicated etcd sub-chart and related values.yaml file to "longhorn"

persistence:
  enabled: true
  storageClass: "longhorn"

started everything with "tilt up"

Currently keeps on running and did not crash since. However, we are now as well checking if the Bitnami chart runs out of the box...

Jul 30 '24 20:07 BadTorro

@Lakshmi2k1

for deleting the pvc it's depends what data are you storing into it. In my case Or generally we are storing the routes only so if I deleted it will restore again once new pvc is created.

In the documentaion they have mentioned more than (n-1) /2 .It means when more than 1 pod (atleast 2 if you have 3 etcd pods) will fail then automatically pods will try to recover. Recently in our QA env all the pods were down so it's better to implement disater recovery.

I have enabled disaster recovery and deployed the helm chart, but this time not just etcd was crashing, the apisix pod stuck in init container, apisix ingress controller was crashing and the snapshot pod was also in error state. So, I rolled back to previous revision again after observing the pod status doesn't seem to change for a long time.

Aug 05 '24 05:08 Lakshmi2k1

@sudhir649 @BadTorro Deploying new version of etcd worked for me initially but whenever the node is scaled down and scaled up, one of the etcd pods is going into crashloop. But since other two etcd pods were running, it wasn't affecting the route and upstream creations. Yet, we are about to use it in production environment, so wish there is a permanent fix. After reading the solution @sudhir649 you have pointed out, I have few questions. 1. Won't deleting the PVC cause loss of data that apisix needs. 2. For second solution, I feel it's good to give a try. (n-1)/2 , in my case number of etcd replicas is 3, so according to this even if one pod is crashing, the disaster recovery cron will run and take backup of pvc. But as you mentioned when two pods were crashing there was no change, when third pod also crashed then all three pods came to running state. But in my case only one or rarely two pods crashing. If you have any inputs let me know. Thanks in advance!

Regarding to that, I managed to get it work by basically:

Deploying longhorn storage solution to the cluster

Configured rancher desktop based on this guide to have open-iscsi in place and useable

changed the storageclass in the dedicated etcd sub-chart and related values.yaml file to "longhorn"
persistence:
  enabled: true
  storageClass: "longhorn"
started everything with "tilt up"

Currently keeps on running and did not crash since. However, we are now as well checking if the Bitnami chart runs out of the box...

Hi @BadTorro , How was the experince after deploying the disater recovery? For us its working fine so we replicate it in all the envs.

Regards, Sudhir

Aug 09 '24 04:08 sudhir649

@Lakshmi2k1

for deleting the pvc it's depends what data are you storing into it. In my case Or generally we are storing the routes only so if I deleted it will restore again once new pvc is created.

In the documentaion they have mentioned more than (n-1) /2 .It means when more than 1 pod (atleast 2 if you have 3 etcd pods) will fail then automatically pods will try to recover. Recently in our QA env all the pods were down so it's better to implement disater recovery.

I have enabled disaster recovery and deployed the helm chart, but this time not just etcd was crashing, the apisix pod stuck in init container, apisix ingress controller was crashing and the snapshot pod was also in error state. So, I rolled back to previous revision again after observing the pod status doesn't seem to change for a long time.

@Lakshmi2k1 The problem you encountered has nothing to do with disaster recovery. I have not experienced this problem

Aug 09 '24 05:08 sudhir649

We had a very similar issue. All of our etcd pods were going into crashloopbackoff and had only these warnings.

Cluster not healthy, not adding self to cluster for now, keeping trying...

We created apisix and etcd through the helm chart and for us the issue was that even though we re-created the StatefulSet and deleted the PVC-s for a fresh start the ETCD_INITIAL_CLUSTER_STATE ENV was still set to existing.

Changed it to new scheduled STS to 0 and then 3 again and it started working for us.

Aug 19 '24 08:08 minedetector

The problem is the same here https://github.com/bitnami/charts/issues/16069 When a Pod present in ETCD_INITIAL_CLUSTER is schduled in a new node it starts with and empty PVC, so the pod is not able to join the cluster anymore.

Sep 03 '24 11:09 pietrogoddibit2win