argo-events icon indicating copy to clipboard operation
argo-events copied to clipboard

HA Sensor has very frequent Leadership changes when running in GKE

Open juliev0 opened this issue 2 years ago • 20 comments

Describe the bug Running a Sensor with 2 replicas in GKE. Every few minutes I see in the leader's log: 2022-04-14T20:59:31.177Z INFO argo-events.sensor leaderelection/leaderelection.go:153 Becoming a Follower, stand by ... {"sensorName": "webhook"}

Note that I am also trying an HA (Calendar) EventSource and it doesn't seem to be happening there.

To Reproduce Run this Sensor:

apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
  name: webhook
spec:
  replicas: 2
  template:
    container:
      image: docker.io/julievogelman/argo-events:fmea
      env:
        - name: DEBUG_LOG
          value: "true"


  eventBusName: jetstream-ex

  dependencies:
    - name: test-dep-a
      eventSourceName: webhook
      eventName: example
    - name: test-dep-b
      eventSourceName: webhook
      eventName: example2
    - name: test-dep-c
      eventSourceName: webhook
      eventName: example3
  triggers:
    - template:
        conditions: "test-dep-a && test-dep-b && test-dep-c"
        conditionsReset: 
          - byTime:
              cron: "14 16 * * *"
              timezone: America/Los_Angeles
        name: trigger-1
        http:
          url: http://abc.com/hello1
          method: GET
    - template:
        conditions: "test-dep-b"
        name: trigger-2
        http:
          url: http://abc.com/hello1
          method: GET

Environment (please complete the following information):

  • Kubernetes: v1.21.6-gke.1503
  • Argo Events: tried with latest master as well as latest Docker image, which I presume is 1.6.3.

Additional context Happens both with STAN and Jetstream bus.


Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

juliev0 avatar Apr 14 '22 23:04 juliev0

Seems to be particular to the image I built locally. I've been running with argo-events:latest and it doesn't seem to be happening to that. Only theory I have is that it's somehow related to my having built the image with "make image" while argo-events:latest is built with "make image-multi" but not sure if that really makes any sense...

juliev0 avatar Apr 16 '22 04:04 juliev0

Sorry, I take back what I said. Plenty of leadership changes happen even when running with quay.io/argoproj/argo-events:latest (also tried with an older version: v1.6.0 to confirm it's not a regression), in GKE. Perhaps it just happens more at some times than others. I think running the stress test on it seems to cause it more than when it's sitting doing nothing.

juliev0 avatar Apr 17 '22 22:04 juliev0

I see frequent leader changes on EKS v1.21 too (argo-events 1.6.3)

tooptoop4 avatar Apr 18 '22 23:04 tooptoop4

@tooptoop4 Thanks for letting me know. Is this something you just now tried or have been running like this for awhile?

juliev0 avatar Apr 19 '22 01:04 juliev0

Glancing at how RAFT works, I see "a leader election is triggered when a follower times out after waiting for a heartbeat from the leader." I wonder if the Sensor is for some reason sometimes not responding to heartbeats.

juliev0 avatar Apr 19 '22 04:04 juliev0

running for a while

tooptoop4 avatar Apr 19 '22 09:04 tooptoop4

So, I'm running v1.6.0 right now in GKE, and on average in the last 24 hours there have been restarts (leadership changes) about every 5 minutes, while it's pretty much been idle (receiving no messages).

I peeked at the nats-io/graft code that we're using, and it looks like the leader sends a heartbeat once every 100 ms. If a follower doesn't receive a heartbeat within 500 ms, a new leader election can occur. The nats-io/graft code is running the goroutine which does this - is it somehow not getting the cycles it needs? I wanted to see if the CPU was for some reason high when it occurs, so I monitored that, but it's very low: <10 millicores.

juliev0 avatar Apr 19 '22 15:04 juliev0

I tried modifying the underlying graft code to set MIN_ELECTION_TIMEOUT to 1 sec (from 500 msec) and the leadership changes don't seem to be happening much.

juliev0 avatar Apr 19 '22 21:04 juliev0

Added an issue in their Github.

Interestingly, somebody else has recently written an issue to make the timeout value and other related constants configurable.

juliev0 avatar Apr 20 '22 16:04 juliev0

is https://github.com/argoproj/argo-events/issues/1680 related?

tooptoop4 avatar Apr 20 '22 22:04 tooptoop4

#1680

Hmm, I don't believe so. The AckWait you were referring to is a setting on the publisher side, which I don't believe we're setting (see my comment over there).

The leader election settings are all separate and contained within nats-io/graft. I suppose the heartbeats are going over the NATS bus as well so they may or may not be setting AckWait there.

juliev0 avatar Apr 21 '22 05:04 juliev0

This issue also occured with HA EventSources.(replicas=2) I'm not sure there is an actual behavior problem. :(

junjunjunk avatar May 09 '22 04:05 junjunjunk

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jul 09 '22 03:07 github-actions[bot]

adding "activity" so this isn't closed - hopefully we'll get to it in the next 60 days

juliev0 avatar Jul 09 '22 20:07 juliev0

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Sep 08 '22 03:09 github-actions[bot]

We are also seeing this behaviour. I was able to tweak the timeouts in the EventBus config but it didn't help.

I also tried updating to the latest version of argo-events (1.7.2) but issue persists.

ekawas-td avatar Sep 15 '22 20:09 ekawas-td

Same here, we're loosing argo-events components regularly due to leader election failure on GKE.

iskandertajine avatar Nov 04 '22 08:11 iskandertajine

Is this issue still not fixed ?

HA is already not really HA but just fallback, but if it is actually provoking the situation it tends to fix.. kind of useless and risky for production setup.

nicolas-vivot avatar Jun 22 '23 00:06 nicolas-vivot

@nicolas-vivot So, Kubernetes Leader Election was implemented (as described here) which was partially for the purpose of addressing this issue.

juliev0 avatar Jun 22 '23 01:06 juliev0

@juliev0 Thank you. So the root issue is not really fixed, but an alternative using Kubernetes Leader Election can be used instead. Got it, will think about using it if i observe the same behavior on production.

nicolas-vivot avatar Jun 22 '23 01:06 nicolas-vivot