Inform user of dropped events
Problem Whenever events are being dropped by event mesh, we should inform the end-user about that. Just logging is not enough for such a case. IMHO, the best to create a dedicated warning level Kubernetes event, which is designed for such a case: https://www.cncf.io/blog/2023/03/13/how-to-use-kubernetes-events-for-effective-alerting-and-monitoring/
Background With the recent addition of the EventTransform API, the chance of misconfiguration by the end-user raises greatly, as they could easily create infinite loops by transforming the source event in a way the original trigger matches again. The TTL mechanism should eventually break that loop, but in that case, we should inform the end-user about it.
Persona: System Operator, Developer
Exit Criteria The end-user could easily identify the Event Mesh configuration is invalid, and some messages are getting dropped. The best would be to allow use of well known K8s monitoring tools - using K8s events should be adequate.
Time Estimate (optional): 5d (events are dropped in number of places across the codebase)
Additional context (optional) My proposal to solve this is to reconcile a Kubernetes event whenever such a situation occurs. Such an event may look like:
apiVersion: events.k8s.io/v1
kind: Event
eventTime: 2025-04-23T09:09:54Z
metadata:
namespace: user-evening-bits-namespace
name: knative-eventing-mt-broker.1838e7822e31b835
labels:
eventing.knative.dev/event-type: my-event-type
eventing.knative.dev/event-source: my-event-source
regarding:
apiVersion: eventing.knative.dev/v1
kind: Broker
name: default
namespace: user-evening-bits-namespace
type: Warning
action: event-dropped
reason: EventLoop
note: Event of type "my-event-type" and source "my-event-source" has reached internal TTL, which most likely signals an event loop. The event was dropped.
series:
count: 351
lastObservedTime: 2025-04-23T09:09:54Z
Notice the series.count. It should be bumped whenever "same event" occurs again. In this case, the reconciler should match the K8s events using metadata.labels of eventing.knative.dev/event-type and eventing.knative.dev/event-source, and bump the series.count when the next message is being dropped.
Hi, can I take up this issue if the feature request is confirmed?
@matzew confirmed this by adding "good first issue" label, so I'm triaging it too.
/triage accepted
@EraKin575 Feel free to provide a PR. You don't need to book issue to work on it. You could create a WIP (draft) pull request that Fixes ### this issue, so everyone will see you're working on it. Furthermore, you may receive some early comments in that way, and that's always helpful!
thanks @cardil . I will raise a draft PR for this
@cardil is this feature still required or has it been settled by now ?
@jijo-OO7 it is still relevant, yes.
Thank you for clarifying i will look into this
Hi @evankanderson, if you get some time, could you please review the following approach and let me know if it looks good to proceed with? Thanks!
Solution
Emit Kubernetes Warning events whenever events are dropped, making issues discoverable via standard Kubernetes tooling (kubectl describe, monitoring systems).
Implementation Strategy
Phase 1: Instrumentation
- Audit all event drop points in the codebase
- Create centralized
RecordEventDropped(eventType, eventSource, reason, namespace, brokerName)abstraction
Phase 2: Reconciliation with Deduplication
- In-memory buffer aggregates drops by unique
(namespace, broker, event-type, event-source, reason) - Periodic reconciliation (every ~30s) batches writes to avoid API overload
- Match existing events by labels; increment
series.counton match, create new event otherwise
Phase 3: Scalability Protection
- Non-blocking signal recording (async to event processing)
- Bounded buffer with overflow handling (circuit breaker at 10k entries)
- Batch reconciliation instead of event-per-drop to minimize API calls
Expected Outcome
Users can run kubectl describe broker default and see warning events like:
Type Reason Age Message
Warning event-dropped 2m Event of type "my-event" and source "my-source" reached TTL. Series count: 351
If you are doing phase 1, you should probably look at ensuring that OpenTelemetry instrumentation is present during "RecordEventDropped", and possibly tie in there. Right now, you don't seem to have the Trigger name included in the RecordEventDropped, but that seems like it's probably the most important part for your recording, as a Broker with 5 Triggers might have events flowing to 4 of the downstream sinks just fine, and then all delivers for one Trigger failing. You'll want to be thoughtful about the telemetry:
- Metrics support high volume of reporting through summation. This means that it is expensive to record high-cardinality events (those with a wide number of varying labels). I'd probably record a subset of the labels you have mentioned in the EventDropped pseudo-function, particularly if
eventSource,eventType, andreasoncould possibly vary on every single event in some cases. - Traces support high volumes of reporting through selective filtering. This means that traces can contain high cardinality of labels and (timed) activity records, but that only e.g. 1 in 1000 events might be selected for recording. For traces, I'd record all of the data, and possibly even see if we record a trace on each delivery attempt, along with a separate trace record for the abandoned discard.
In terms of phase 2 reporting to the Kubernetes control plane, I would probably aggregate the events by namespace and trigger, rather than by broker, event-type or reason. Additionally, I'd probably try to limit the periodic reconciliation to a single event per Trigger, plus a possible additional reconciliation for the Broker which includes the names of all the Triggers which had updated failure counts.
Other than those two points, I think your plan looks reasonable. One alternate (not better, just possibly not-considered) plan might look like:
Phase 1: K8s-Event-Reporting sink
Write a pod which receives events from a Knative Trigger as a deadLetterSink. The pod should aggregate events by knativeerrordest (a CloudEvents string attribute) and possibly other headers (you might also need to augment the Eventing code, if you can't get broker name and trigger name from the dead-letter event). When the pod receives an event for a particular Broker or Trigger, it should report the first event over a 30s or 1 minute window, and then aggregate subsequent events for reporting after the 30s window is over (as you describe in phase 2).
Note that the pod receiving the events will need permissions to read and write to the apiserver assigned to its service account. The advantage here is that development should be fairly rapid (green-field), and similarly testing should be easy to iterate on, since the work is in a single pod in a single development language.
Phase 2: Cluster Default Delivery Spec
Using delivery spec defaults, users could route failed deliveries to the k8s-event-reporting-sink. This should have a similar effect to having each event-sending pod write to the k8s API, but with (hopefully) less scaling requirements on the reporting pod.
Phase 3: Ship k8s-event-reporting sink as a default component and configuration
Once phase 2 testing has succeeded, we could change the eventing defaults to ship the k8s-event-reporting Pod/Deployment by default in Knative Eventing.
Every plan has tradeoffs; the ones I see are:
- Learning existing Knative Eventing code: benefit original plan. This plan allows you to mostly write new code using the public interfaces, rather than needing to deeply understand the eventing internals.
- Separating failure domains & permissions: benefit this plan. In the original, the event senders handle multiple retries and also interact with the more-fragile k8s control plane.
- Consistent event metrics: benefit original plan. By requiring "RecordEventDropped" in all the implementations (and data-plane languages), you get a chance to go through and ensure that metrics and traces are updated.
- Example usage of features: benefit this plan. This plan uses the dead-letter and defaults features of Brokers and Channels, which serves as a reference or template for end-user usage of these features.
@Cali0707, @matzew, and @creydr may also want to weigh in here.
@jijo-OO7 so this is a new issue ?
Not a new issue we’ll be working on this one. Evan has already shared a roadmap, so I’m reviewing Phase 1 right now and getting aligned before moving ahead. Your ideas are welcome too if you’d like to add anything.
@jijo-OO7 A new sink component that receives dead-lettered CloudEvents from Knative Triggers, aggregates them by failure destination over 30-second windows, and reports them as Kubernetes Warning Events with series.count for deduplication. This makes operators to monitor delivery failures via kubectl event tooling. When TTL is exhausted the event is dropped directly with a log message but no notification mechanism. This is where the k8s report approach would be valuable - instead of just logging
in my conisderation @jijo-OO7 needs a K8s Event Reporter sink that Receives dead-letter events from Triggers Detects TTL exhaustion creates Kubernetes warning events visible via standard tools
@jijo-OO7 @evankanderson @cardil i have done implementing my above consideration and fixed the issue and tested should i open a pr for this
Sounds good, @namanONcode If everything is implemented and tested on your side, you can go ahead and open a PR. The maintainers will review it once it’s up. Thanks for working on this !
https://github.com/knative/eventing/pull/8817 @cardil @evankanderson @jijo-OO7 @pierDipi i open this pr for this issues please review
I’m beginning work on Phase 1, focusing on instrumentation and introducing a centralized RecordEventDropped entry point based on the earlier discussion. Will share updates as things progress.
I’m beginning work on Phase 1, focusing on instrumentation and introducing a centralized
RecordEventDroppedentry point based on the earlier discussion. Will share updates as things progress.
@cardil @evankanderson @jijo-OO7 @pierDipi i have done in this pr https://github.com/knative/eventing/pull/8817 please review for continue check