argo-events icon indicating copy to clipboard operation
argo-events copied to clipboard

bug: kafka sensors stop working quietly

Open AmitMendl opened this issue 6 months ago • 2 comments

Describe the bug Sensors connected to a kafka eventbus stop working some while after being deployed.

To Reproduce Steps to reproduce the behavior:

  1. Create a kafka eventbus
  2. Create a basic HTTP webhook
  3. Create a kafka sensor (doesn't matter what it triggers)
  4. Wait a couple of weeks
  5. Sensor will stop receiving events

Expected behavior The sensor should receive the events, or at least the pod

Environment (please complete the following information):

  • Kubernetes: v1.28
  • Argo Events: v1.8.0
  • Streamzi: v0.39.0

Screenshots I unfortunelty cannot supply any screenshots or logs, since this happened in an offline environment, I will try to describe the logs.

  • eventsource logs indicated that the event was created
  • akhq into kafka showed that events were created in the kafka topic, but at some point in the middle of the topic they aren't being consumed.
  • logs on the sensor pod show the start of a kafka transaction but no the end of it.

Additional context

  • This does not happen instantly, and usually follows dozens of days that the sensor works fine
  • When investigating this, I have verified that the event is being created, and I could see the message being added to eventbus's topic
  • this happens seemingly at random, some sensors will fail after 2 weeks, some after a month, some are yet to fail.

Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

AmitMendl avatar Dec 29 '23 18:12 AmitMendl

After looking at the source code I suspect that this is caused by the producer's Errors() channel not being read, I created a pull requested that I believe that fixes this, https://github.com/argoproj/argo-events/pull/2959

AmitMendl avatar Dec 29 '23 19:12 AmitMendl

We have also experienced this behavior. Sensors stop writing logs and no longer process events from Kafka, despite the events being created properly there (there is an offset lag)

We run using OpenShift and this happens both on version 4.10 and 4.12 of OCP (k8s versions are 1.23.5 and 1.25.14).

James-Derune avatar Dec 31 '23 09:12 James-Derune

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 01 '24 02:03 github-actions[bot]