apisix-ingress-controller bug: ingress-controller doesn't recover from failed sync

Current Behavior

occasionally, our controller will stop doing anything after failing to sync, but it also doesn't exit

we have logs for this occurring immediately following an upgrade, but we've seen similar behavior elsewhere too

Expected Behavior

The ingress controller should either retry running its logic or exit.

Error Logs

2023-09-22T17:40:34+08:00	info	ingress/ingress.go:145	start ingress controller
2023-09-22T17:40:34+08:00	info	providers/controller.go:155	start api server
2023-09-22T17:40:34+08:00	info	providers/controller.go:170	start leader election
I0922 17:40:34.168741       1 leaderelection.go:250] attempting to acquire leader lease apisix/ingress-apisix-internal-leader...
2023-09-22T17:40:34+08:00	warn	providers/controller.go:220	found a new leader apisix-internal-ingress-controller-54d9c8d775-9qwkf
2023-09-22T17:40:34+08:00	info	providers/controller.go:222	controller now is running as a candidate	{"namespace": "apisix", "pod": "apisix-internal-ingress-controller-5948cfdc95-hq8cm"}
2023-09-22T17:40:34+08:00	info	providers/controller.go:387	controller tries to leading ...	{"namespace": "apisix", "pod": "apisix-internal-ingress-controller-5948cfdc95-hq8cm"}
2023-09-22T17:40:34+08:00	warn	apisix/cluster.go:423	waiting cluster default to ready, it may takes a while
2023-09-22T17:40:34+08:00	info	apisix/cluster.go:248	syncing cache	{"cluster": "default"}
2023-09-22T17:40:34+08:00	info	apisix/cluster.go:463	syncing schema	{"cluster": "default"}
2023-09-22T17:40:34+08:00	error	apisix/route.go:90	failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:34+08:00	error	apisix/cluster.go:298	failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:36+08:00	error	apisix/route.go:90	failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:36+08:00	error	apisix/cluster.go:298	failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:38+08:00	error	apisix/cluster.go:298	failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:40+08:00	error	apisix/route.go:90	failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:40+08:00	error	apisix/cluster.go:298	failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:42+08:00	error	apisix/route.go:90	failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:42+08:00	error	apisix/cluster.go:298	failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:42+08:00	error	apisix/cluster.go:258	failed to sync cache	{"cost_time": "8.011791225s", "cluster": "default"}
2023-09-22T17:40:42+08:00	error	providers/controller.go:419	failed to wait the default cluster to be ready: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:42+08:00	info	apisix/cluster.go:248	syncing cache	{"cluster": "default"}
2023-09-22T17:40:42+08:00	info	apisix/cluster.go:463	syncing schema	{"cluster": "default"}
2023-09-22T17:40:42+08:00	error	apisix/route.go:90	failed to list routes: Get "http://apisix-admin.apisix.svc.cluster.local:9180/apisix/admin/routes": context canceled
2023-09-22T17:40:42+08:00	error	apisix/plugin.go:46	failed to list plugins' names: Get "http://apisix-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
2023-09-22T17:40:42+08:00	error	apisix/cluster.go:298	failed to list routes in APISIX: Get "http://apisix-admin.apisix.svc.cluster.local:9180/apisix/admin/routes": context canceled
2023-09-22T17:40:42+08:00	error	apisix/cluster.go:483	failed to list plugin names in APISIX: Get "http://apisix-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
2023-09-22T17:40:42+08:00	error	apisix/cluster.go:446	failed to sync schema: Get "http://apisix-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
2023-09-22T17:40:42+08:00	error	apisix/cluster.go:258	failed to sync cache	{"cost_time": "99.599µs", "cluster": "default"}

Do note that i pulled this log on 2023-09-26 - the ingress has just not logged anything in days.

Steps to Reproduce

Not fully known, but etcd/apisix being unavailable at critical moments seems to trigger it.

Environment

APISIX Ingress controller version (run apisix-ingress-controller version --long) apache/apisix-ingress-controller:1.7.0
Kubernetes cluster version (run kubectl version) v1.26.6
OS version if running APISIX Ingress controller in a bare-metal environment (run uname -a) N/A

Sep 26 '23 05:09 acuteaura

I should note, this is running with a single replica, it seems just quit here, believing another replica will take on leadership, though that never happens.

Sep 26 '23 07:09 acuteaura

invalid response code

In fact, these responses come from the APISIX admin API. This means that unhealthy APISIX responded to these admin API requests.

Sep 26 '23 07:09 tao12345666333

That is to be expected, since this happened during a helm upgrade and APISIX pods were being recreated at the same time.

Either way, the controller getting stuck doing nothing is the unexpected behavior. The controller simply can not assume that the APISIX API will be reachable at startup. I'd suggest a failed sync should cause the ingress controller to exit (or retry, if refactored to, it seems to cancel an in-flight retry attempt via the defer cancel), it is not a good signal for giving up on leader election.

Sep 26 '23 07:09 acuteaura

You can check the Helm chart. There has a init container for check APISIX's connection

Sep 26 '23 12:09 tao12345666333

That only checks if the TCP socket is up; same with the readiness probe on APISIX itself (it's not an HTTP check). I'm not sure if the APISIX itself takes a second to boot up or if the etcd being replaced as well might be the culprit (we run etcd with replicas=1 right now), but the server sending a 502/503 is not a condition that's checked.

Sep 26 '23 12:09 acuteaura

This issue has been marked as stale due to 90 days of inactivity. It will be closed in 30 days if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

Dec 26 '23 01:12 github-actions[bot]

/reopen

still an edge case the ingress controller can't recover from

Dec 26 '23 09:12 acuteaura

Also running into this issue here.

Jan 12 '24 13:01 geoffreytran

This is a criticial issue and checking if the TCP port is available is absolutely not sufficient, @tao12345666333. We had yet another ingress controller that simply got stuck. Frankly, the entire leader election code seems broken (e.g reports as "running as leader" in the callback, where nothing else happens, starts controller loop once any controller is elected), and I'll look into getting a PR in for this as soon as possible (after #2139), this is unfortunately making us look pretty bad as an infra team for adopting the ingress.

2024-01-19T21:29:00+08:00	[34minfo[0m	ingress/ingress.go:121	init apisix ingress controller
2024-01-19T21:29:00+08:00	[34minfo[0m	ingress/ingress.go:123	version:
Version: 1.7.0
Git SHA: no-git-module
Go Version: go1.20.8
Building OS/Arch: linux/amd64
Running OS/Arch: linux/amd64

2024-01-19T21:29:00+08:00	[34minfo[0m	ingress/ingress.go:133	use configuration
{
  "cert_file": "/etc/webhook/certs/cert.pem",
  "key_file": "/etc/webhook/certs/key.pem",
  "log_level": "info",
  "log_output": "stderr",
  "log_rotate_output_path": "",
  "log_rotation_max_size": 100,
  "log_rotation_max_age": 0,
  "log_rotation_max_backups": 0,
  "http_listen": ":8080",
  "https_listen": ":8443",
  "ingress_publish_service": "apisix/apisix-internal-gateway",
  "ingress_status_address": [],
  "enable_profiling": true,
  "kubernetes": {
    "kubeconfig": "",
    "resync_interval": "6h0m0s",
    "namespace_selector": [],
    "election_id": "ingress-apisix-internal-leader",
    "ingress_class": "apisix-internal",
    "ingress_version": "networking/v1",
    "watch_endpoint_slices": false,
    "api_version": "apisix.apache.org/v2",
    "enable_gateway_api": false,
    "disable_status_updates": false,
    "enable_admission": false
  },
  "apisix": {
    "admin_api_version": "v3",
    "default_cluster_name": "default",
    "default_cluster_base_url": "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin",
    "default_cluster_admin_key": "******"
  },
  "apisix_resource_sync_interval": "1h0m0s",
  "apisix_resource_sync_comparison": true,
  "plugin_metadata_cm": "",
  "etcdserver": {
    "enabled": false,
    "prefix": "/apisix",
    "listen_address": ":12379",
    "ssl_key_encrypt_salt": "edd1c9f0985e76a2"
  }
}
2024-01-19T21:29:00+08:00	[34minfo[0m	ingress/ingress.go:145	start ingress controller
2024-01-19T21:29:00+08:00	[34minfo[0m	providers/controller.go:170	start leader election
2024-01-19T21:29:00+08:00	[34minfo[0m	providers/controller.go:155	start api server
I0119 21:29:00.286860       1 leaderelection.go:250] attempting to acquire leader lease apisix/ingress-apisix-internal-leader...
2024-01-19T21:29:00+08:00	[33mwarn[0m	providers/controller.go:220	found a new leader apisix-internal-ingress-controller-578bf88584-jvs59
2024-01-19T21:29:00+08:00	[34minfo[0m	providers/controller.go:222	controller now is running as a candidate	{"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
2024-01-19T21:29:00+08:00	[34minfo[0m	providers/controller.go:387	controller tries to leading ...	{"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
2024-01-19T21:29:00+08:00	[34minfo[0m	apisix/stream_route.go:38	resource stream_routes is disabled
2024-01-19T21:29:00+08:00	[34minfo[0m	apisix/cluster.go:248	syncing cache	{"cluster": "default"}
2024-01-19T21:29:00+08:00	[33mwarn[0m	apisix/cluster.go:423	waiting cluster default to ready, it may takes a while
2024-01-19T21:29:00+08:00	[34minfo[0m	apisix/cluster.go:463	syncing schema	{"cluster": "default"}
2024-01-19T21:29:00+08:00	[31merror[0m	apisix/route.go:90	failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:00+08:00	[31merror[0m	apisix/cluster.go:298	failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:02+08:00	[31merror[0m	apisix/route.go:90	failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:02+08:00	[31merror[0m	apisix/cluster.go:298	failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:03+08:00	[34minfo[0m	providers/controller.go:141	LeaderElection	{"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l became leader", "event_type": "Normal"}
I0119 21:29:03.113259       1 leaderelection.go:260] successfully acquired lease apisix/ingress-apisix-internal-leader
2024-01-19T21:29:03+08:00	[33mwarn[0m	providers/controller.go:220	found a new leader apisix-internal-ingress-controller-9d96f45cb-4kg8l
2024-01-19T21:29:03+08:00	[34minfo[0m	providers/controller.go:212	controller now is running as leader	{"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
2024-01-19T21:29:04+08:00	[31merror[0m	apisix/route.go:90	failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:04+08:00	[31merror[0m	apisix/cluster.go:298	failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:06+08:00	[31merror[0m	apisix/route.go:90	failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:06+08:00	[31merror[0m	apisix/cluster.go:298	failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:08+08:00	[31merror[0m	apisix/route.go:90	failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:08+08:00	[31merror[0m	apisix/cluster.go:298	failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:08+08:00	[31merror[0m	apisix/cluster.go:258	failed to sync cache	{"cost_time": "8.016969049s", "cluster": "default"}
2024-01-19T21:29:08+08:00	[31merror[0m	providers/controller.go:419	failed to wait the default cluster to be ready: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:08+08:00	[34minfo[0m	apisix/stream_route.go:38	resource stream_routes is disabled
2024-01-19T21:29:08+08:00	[34minfo[0m	apisix/cluster.go:248	syncing cache	{"cluster": "default"}
2024-01-19T21:29:08+08:00	[34minfo[0m	apisix/cluster.go:463	syncing schema	{"cluster": "default"}
2024-01-19T21:29:08+08:00	[31merror[0m	apisix/route.go:90	failed to list routes: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/routes": context canceled
2024-01-19T21:29:08+08:00	[31merror[0m	apisix/cluster.go:298	failed to list routes in APISIX: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/routes": context canceled
2024-01-19T21:29:08+08:00	[31merror[0m	apisix/cluster.go:258	failed to sync cache	{"cost_time": "60.702µs", "cluster": "default"}
2024-01-19T21:29:08+08:00	[31merror[0m	apisix/plugin.go:46	failed to list plugins' names: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
2024-01-19T21:29:08+08:00	[31merror[0m	apisix/cluster.go:483	failed to list plugin names in APISIX: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
2024-01-19T21:29:08+08:00	[31merror[0m	apisix/cluster.go:446	failed to sync schema: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
E0122 23:07:05.784720       1 leaderelection.go:369] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/apisix/leases/ingress-apisix-internal-leader": context deadline exceeded
I0122 23:07:05.784783       1 leaderelection.go:285] failed to renew lease apisix/ingress-apisix-internal-leader: timed out waiting for the condition
2024-01-22T23:07:09+08:00	[34minfo[0m	providers/controller.go:141	LeaderElection	{"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l stopped leading", "event_type": "Normal"}
2024-01-22T23:07:09+08:00	[34minfo[0m	providers/controller.go:231	controller now is running as a candidate	{"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
I0122 23:07:09.152585       1 leaderelection.go:250] attempting to acquire leader lease apisix/ingress-apisix-internal-leader...
2024-01-22T23:07:09+08:00	[34minfo[0m	providers/controller.go:141	LeaderElection	{"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l became leader", "event_type": "Normal"}
I0122 23:07:09.984780       1 leaderelection.go:260] successfully acquired lease apisix/ingress-apisix-internal-leader
2024-01-22T23:07:09+08:00	[34minfo[0m	providers/controller.go:212	controller now is running as leader	{"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
E0124 13:07:40.830633       1 leaderelection.go:369] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/apisix/leases/ingress-apisix-internal-leader": context deadline exceeded
I0124 13:07:40.830667       1 leaderelection.go:285] failed to renew lease apisix/ingress-apisix-internal-leader: timed out waiting for the condition
2024-01-24T13:07:40+08:00	[34minfo[0m	providers/controller.go:141	LeaderElection	{"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l stopped leading", "event_type": "Normal"}
2024-01-24T13:07:40+08:00	[34minfo[0m	providers/controller.go:231	controller now is running as a candidate	{"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
I0124 13:07:40.843705       1 leaderelection.go:250] attempting to acquire leader lease apisix/ingress-apisix-internal-leader...
2024-01-24T13:07:40+08:00	[34minfo[0m	providers/controller.go:141	LeaderElection	{"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l became leader", "event_type": "Normal"}
I0124 13:07:40.864098       1 leaderelection.go:260] successfully acquired lease apisix/ingress-apisix-internal-leader
2024-01-24T13:07:40+08:00	[34minfo[0m	providers/controller.go:212	controller now is running as leader	{"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}

Jan 25 '24 12:01 acuteaura

@acuteaura I will take up this item.

Jan 30 '24 23:01 Revolyssup

I'm still happy to submit a PR next week if you'd prefer. I already dug around the code a bunch and my other open PR should address at least one of these cases (where returning nil to the command would not exit).

Jan 31 '24 07:01 acuteaura

@acuteaura I approved that PR. You can create a PR for this issue.

Jan 31 '24 13:01 Revolyssup

For the record, we got the same incident this week with a production environment because the ingress-controller stop to work without failing or retrying.

I agreed with @acuteaura is a critical bug.

Feb 08 '24 08:02 flassagn

A first mitigation action could be to update the healthz endpoint could check if the cluster has a leader and failed.

Feb 08 '24 08:02 flassagn

#2139 and #2152 fix a bunch of these bugs, but I'm currently stuck trying to figure out why the E2E tests are failing on the second one.

The problem isn't with having no leader, it's quitting out of run when the APISIX API can't be reached without stepping down and quitting, which before #2139 meant that the controller was stuck waiting for an Interrupt while having leadership.

A possible hotfix would be an init container to check if the API is available and to fix health/readiness checks on apisix itself. Right now it checks if there's a TCP socket, but that doen't suffice when APISIX is still booting up and rejects API requests with 503.

Feb 08 '24 08:02 acuteaura

Specifically, the bug we're seeing and that likely you are seeing is here

Feb 08 '24 08:02 acuteaura

Hi,

First, I would like to say thank you about peoples which are currently working on it. Then, I would like to ask kindly if any update is done here. :bow:

Mar 29 '24 08:03 flassagn

Keep an eye on #2152, I'm ready to pick up work on it as soon as someone figures out why the E2E tests don't like that PR.

Mar 29 '24 11:03 acuteaura

Even if APISIX is started long before the controller I'm getting the "context canceled" errors when relaunching the controller.

If I get into the controller's Pod and do the same requests with curl manually, I get the correct responses.

The bug isn't random for me, the controller can't start working at all.

Apr 16 '24 09:04 VelorumS

Is there an update on this? We have recently started seeing this issue pop up more and more on our clusters that use spot instance, thus pods get recreated pretty often. We are running the latest version (1.8.2). I am happy to help with whatever is necessary to move this forward.

May 14 '24 19:05 Ben10k

you could patch in an os.Exit on the TODO line in Controller.run inside pkg/providers/controller.go for now, or even submit that as a PR. That'd certainly be better than the status quo of ignoring failed initalization and just not running the controller. I'd much prefer if leader election wasn't "both controllers do everything but one skips writes (hopefully everywhere)" because it's a nice footgun for contributors who might not be aware of the larger image, but I don't have the resources to reproduce why the somewhat fragile E2E test suite (that unfortunately also somewhat relies on running inside actions) fails on that one.

May 15 '24 07:05 acuteaura

I do agree that that having the not-leader would not do anyhting at all instead of what it does now. I'll try to patch your change with the os.Exit as a new PR here, and will try to look into the e2e suite, but I am not very proficient with this code-base or even golang in general so I'm not sure if I'll be able to figure that one out. Thanks for the quick response.

May 15 '24 08:05 Ben10k

Looking forward for the fix. We constantly running into the issue if our K8 (GKE) does an update on the cluster where apisix is running. More or less everytime we have to manully restart the ingress-controller to get things working again.

May 16 '24 07:05 wofr

We also ran into a similar issue in prod in 1.7.1.

Aug 13 '24 04:08 ssingh3339

apisix-ingress-controller apisix-ingress-controller copied to clipboard

bug: ingress-controller doesn't recover from failed sync

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

apisix-ingress-controller
apisix-ingress-controller copied to clipboard