argo-events
argo-events copied to clipboard
HA Sensor has very frequent Leadership changes when running in GKE
Describe the bug Running a Sensor with 2 replicas in GKE. Every few minutes I see in the leader's log: 2022-04-14T20:59:31.177Z INFO argo-events.sensor leaderelection/leaderelection.go:153 Becoming a Follower, stand by ... {"sensorName": "webhook"}
Note that I am also trying an HA (Calendar) EventSource and it doesn't seem to be happening there.
To Reproduce Run this Sensor:
apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
name: webhook
spec:
replicas: 2
template:
container:
image: docker.io/julievogelman/argo-events:fmea
env:
- name: DEBUG_LOG
value: "true"
eventBusName: jetstream-ex
dependencies:
- name: test-dep-a
eventSourceName: webhook
eventName: example
- name: test-dep-b
eventSourceName: webhook
eventName: example2
- name: test-dep-c
eventSourceName: webhook
eventName: example3
triggers:
- template:
conditions: "test-dep-a && test-dep-b && test-dep-c"
conditionsReset:
- byTime:
cron: "14 16 * * *"
timezone: America/Los_Angeles
name: trigger-1
http:
url: http://abc.com/hello1
method: GET
- template:
conditions: "test-dep-b"
name: trigger-2
http:
url: http://abc.com/hello1
method: GET
Environment (please complete the following information):
- Kubernetes: v1.21.6-gke.1503
- Argo Events: tried with latest master as well as latest Docker image, which I presume is 1.6.3.
Additional context Happens both with STAN and Jetstream bus.
Message from the maintainers:
If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
Seems to be particular to the image I built locally. I've been running with argo-events:latest and it doesn't seem to be happening to that. Only theory I have is that it's somehow related to my having built the image with "make image" while argo-events:latest is built with "make image-multi" but not sure if that really makes any sense...
Sorry, I take back what I said. Plenty of leadership changes happen even when running with quay.io/argoproj/argo-events:latest (also tried with an older version: v1.6.0 to confirm it's not a regression), in GKE. Perhaps it just happens more at some times than others. I think running the stress test on it seems to cause it more than when it's sitting doing nothing.
I see frequent leader changes on EKS v1.21 too (argo-events 1.6.3)
@tooptoop4 Thanks for letting me know. Is this something you just now tried or have been running like this for awhile?
Glancing at how RAFT works, I see "a leader election is triggered when a follower times out after waiting for a heartbeat from the leader." I wonder if the Sensor is for some reason sometimes not responding to heartbeats.
running for a while
So, I'm running v1.6.0 right now in GKE, and on average in the last 24 hours there have been restarts (leadership changes) about every 5 minutes, while it's pretty much been idle (receiving no messages).
I peeked at the nats-io/graft code that we're using, and it looks like the leader sends a heartbeat once every 100 ms. If a follower doesn't receive a heartbeat within 500 ms, a new leader election can occur. The nats-io/graft code is running the goroutine which does this - is it somehow not getting the cycles it needs? I wanted to see if the CPU was for some reason high when it occurs, so I monitored that, but it's very low: <10 millicores.
I tried modifying the underlying graft code to set MIN_ELECTION_TIMEOUT to 1 sec (from 500 msec) and the leadership changes don't seem to be happening much.
Added an issue in their Github.
Interestingly, somebody else has recently written an issue to make the timeout value and other related constants configurable.
is https://github.com/argoproj/argo-events/issues/1680 related?
#1680
Hmm, I don't believe so. The AckWait you were referring to is a setting on the publisher side, which I don't believe we're setting (see my comment over there).
The leader election settings are all separate and contained within nats-io/graft. I suppose the heartbeats are going over the NATS bus as well so they may or may not be setting AckWait there.
This issue also occured with HA EventSources.(replicas=2) I'm not sure there is an actual behavior problem. :(
This issue has been automatically marked as stale because it has not had any activity in the last 60 days. It will be closed if no further activity occurs. Thank you for your contributions.
adding "activity" so this isn't closed - hopefully we'll get to it in the next 60 days
This issue has been automatically marked as stale because it has not had any activity in the last 60 days. It will be closed if no further activity occurs. Thank you for your contributions.
We are also seeing this behaviour. I was able to tweak the timeouts in the EventBus config but it didn't help.
I also tried updating to the latest version of argo-events (1.7.2) but issue persists.
Same here, we're loosing argo-events components regularly due to leader election failure on GKE.
Is this issue still not fixed ?
HA is already not really HA but just fallback, but if it is actually provoking the situation it tends to fix.. kind of useless and risky for production setup.
@nicolas-vivot So, Kubernetes Leader Election was implemented (as described here) which was partially for the purpose of addressing this issue.
@juliev0 Thank you. So the root issue is not really fixed, but an alternative using Kubernetes Leader Election can be used instead. Got it, will think about using it if i observe the same behavior on production.