argo-events HA Sensor has very frequent Leadership changes when running in GKE

Describe the bug Running a Sensor with 2 replicas in GKE. Every few minutes I see in the leader's log: 2022-04-14T20:59:31.177Z INFO argo-events.sensor leaderelection/leaderelection.go:153 Becoming a Follower, stand by ... {"sensorName": "webhook"}

Note that I am also trying an HA (Calendar) EventSource and it doesn't seem to be happening there.

To Reproduce Run this Sensor:

apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
  name: webhook
spec:
  replicas: 2
  template:
    container:
      image: docker.io/julievogelman/argo-events:fmea
      env:
        - name: DEBUG_LOG
          value: "true"


  eventBusName: jetstream-ex

  dependencies:
    - name: test-dep-a
      eventSourceName: webhook
      eventName: example
    - name: test-dep-b
      eventSourceName: webhook
      eventName: example2
    - name: test-dep-c
      eventSourceName: webhook
      eventName: example3
  triggers:
    - template:
        conditions: "test-dep-a && test-dep-b && test-dep-c"
        conditionsReset: 
          - byTime:
              cron: "14 16 * * *"
              timezone: America/Los_Angeles
        name: trigger-1
        http:
          url: http://abc.com/hello1
          method: GET
    - template:
        conditions: "test-dep-b"
        name: trigger-2
        http:
          url: http://abc.com/hello1
          method: GET

Environment (please complete the following information):

Kubernetes: v1.21.6-gke.1503
Argo Events: tried with latest master as well as latest Docker image, which I presume is 1.6.3.

Additional context Happens both with STAN and Jetstream bus.

Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

Apr 14 '22 23:04 juliev0

Seems to be particular to the image I built locally. I've been running with argo-events:latest and it doesn't seem to be happening to that. Only theory I have is that it's somehow related to my having built the image with "make image" while argo-events:latest is built with "make image-multi" but not sure if that really makes any sense...

Apr 16 '22 04:04 juliev0

Sorry, I take back what I said. Plenty of leadership changes happen even when running with quay.io/argoproj/argo-events:latest (also tried with an older version: v1.6.0 to confirm it's not a regression), in GKE. Perhaps it just happens more at some times than others. I think running the stress test on it seems to cause it more than when it's sitting doing nothing.

Apr 17 '22 22:04 juliev0

I see frequent leader changes on EKS v1.21 too (argo-events 1.6.3)

Apr 18 '22 23:04 tooptoop4

@tooptoop4 Thanks for letting me know. Is this something you just now tried or have been running like this for awhile?

Apr 19 '22 01:04 juliev0

Glancing at how RAFT works, I see "a leader election is triggered when a follower times out after waiting for a heartbeat from the leader." I wonder if the Sensor is for some reason sometimes not responding to heartbeats.

Apr 19 '22 04:04 juliev0

running for a while

Apr 19 '22 09:04 tooptoop4

So, I'm running v1.6.0 right now in GKE, and on average in the last 24 hours there have been restarts (leadership changes) about every 5 minutes, while it's pretty much been idle (receiving no messages).

I peeked at the nats-io/graft code that we're using, and it looks like the leader sends a heartbeat once every 100 ms. If a follower doesn't receive a heartbeat within 500 ms, a new leader election can occur. The nats-io/graft code is running the goroutine which does this - is it somehow not getting the cycles it needs? I wanted to see if the CPU was for some reason high when it occurs, so I monitored that, but it's very low: <10 millicores.

Apr 19 '22 15:04 juliev0

I tried modifying the underlying graft code to set MIN_ELECTION_TIMEOUT to 1 sec (from 500 msec) and the leadership changes don't seem to be happening much.

Apr 19 '22 21:04 juliev0

Added an issue in their Github.

Interestingly, somebody else has recently written an issue to make the timeout value and other related constants configurable.

Apr 20 '22 16:04 juliev0

is https://github.com/argoproj/argo-events/issues/1680 related?

Apr 20 '22 22:04 tooptoop4

#1680

Hmm, I don't believe so. The AckWait you were referring to is a setting on the publisher side, which I don't believe we're setting (see my comment over there).

The leader election settings are all separate and contained within nats-io/graft. I suppose the heartbeats are going over the NATS bus as well so they may or may not be setting AckWait there.

Apr 21 '22 05:04 juliev0

This issue also occured with HA EventSources.(replicas=2) I'm not sure there is an actual behavior problem. :(

May 09 '22 04:05 junjunjunk

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 09 '22 03:07 github-actions[bot]

adding "activity" so this isn't closed - hopefully we'll get to it in the next 60 days

Jul 09 '22 20:07 juliev0

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 08 '22 03:09 github-actions[bot]

We are also seeing this behaviour. I was able to tweak the timeouts in the EventBus config but it didn't help.

I also tried updating to the latest version of argo-events (1.7.2) but issue persists.

Sep 15 '22 20:09 ekawas-td

Same here, we're loosing argo-events components regularly due to leader election failure on GKE.

Nov 04 '22 08:11 iskandertajine

Is this issue still not fixed ?

HA is already not really HA but just fallback, but if it is actually provoking the situation it tends to fix.. kind of useless and risky for production setup.

Jun 22 '23 00:06 nicolas-vivot

@nicolas-vivot So, Kubernetes Leader Election was implemented (as described here) which was partially for the purpose of addressing this issue.

Jun 22 '23 01:06 juliev0

@juliev0 Thank you. So the root issue is not really fixed, but an alternative using Kubernetes Leader Election can be used instead. Got it, will think about using it if i observe the same behavior on production.

Jun 22 '23 01:06 nicolas-vivot

argo-events argo-events copied to clipboard

HA Sensor has very frequent Leadership changes when running in GKE

argo-events
argo-events copied to clipboard