karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

Emit DisruptionBlocked events on affected pod or pdb resource

Open cnmcavoy opened this issue 9 months ago • 12 comments

Description

What problem are you trying to solve? As a cluster admin, I use Karpenter's events to understand and triage when disruption is occurring less frequently than expected. Karpenter emits DisruptionBlocked events when a node can not be disrupted, and if it is because of a pod (with a do-not-disrupt annotation) or pdb, the resource and namespace is in the event message: https://github.com/kubernetes-sigs/karpenter/blob/c0e7299834ad615263172c7593049e64ea521cf1/pkg/controllers/disruption/events/events.go#L95-L115

Because the InvolvedObject is the node + nodeclaim, the DisruptionBlocked events always end up in the default namespace, rather than the user's namespace with the pod or pdb. This means that the message has to be parsed by tools in order to extract the namespace from the event, which is burdonsome (and really hurts the ability to index these events in tools like datadog). Either emitting these events on the affected resource, or emitted a second duplicate event on the affected resource would satisfy our use-case.

How important is this feature to you?

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

cnmcavoy avatar Feb 20 '25 22:02 cnmcavoy

Interesting -- what's the problem that you are running into here that causes you to need these events? Are users creating faulty PDBs and then you need to inform them that they are fully blocking the node from being disrupted? Same thing with karpenter.sh/do-not-disrupt?

jonathan-innis avatar Feb 21 '25 14:02 jonathan-innis

Right. We have spark users that annotate their worker pods with the do-not-disrupt annotation (or less commonly, misconfigure a pdb) that end up blocking disruption. We use datadog (which has integration into kubernetes events) and have dashboards set up for users so they can monitor their workload and namespace. Because the events are outside of the user namespace, it's not visible to them in this kind of presentation.

cnmcavoy avatar Feb 24 '25 18:02 cnmcavoy

/assign @GnatorX

engedaam avatar Mar 03 '25 23:03 engedaam

Hey, I think this make sense however I think we wouldn't changed "Involved Object" but rather "Related" instead, would that work? Involved object is used to directly call out the object involved so it won't make sense to send consolidation block events to the pod or PDB

https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/ https://www.pulumi.com/registry/packages/kubernetes/api-docs/core/v1/event/

GnatorX avatar Mar 03 '25 23:03 GnatorX

Interesting, didn't know this existed @GnatorX! That's pretty cool! Seems completely reasonable to me if we could get that there

jonathan-innis avatar Mar 04 '25 21:03 jonathan-innis

/triage accepted /priority important-longterm

jonathan-innis avatar Mar 04 '25 21:03 jonathan-innis

Ya! Initially I thought this was only supported in events/v1 but it seems like it was supported back in core/v1 events so we didn't need to upgrade Karpenter

GnatorX avatar Mar 04 '25 22:03 GnatorX

How does that appear to the client? Are events with related objects visible in the describe output for a related resource?

I don't see a way to manage the related field on the EventRecorder, so it still might involve rewriting how events are generated in Karpenter.

cnmcavoy avatar Mar 04 '25 23:03 cnmcavoy

How does that appear to the client? Are events with related objects visible in the describe output for a related resource?

That depends on how your clients consume "event" object but it should just be a field within the struct https://github.com/kubernetes/api/blob/v0.32.2/core/v1/types.go#L7088

I don't see a way to manage the related field on the EventRecorder, so it still might involve rewriting how events are generated in Karpenter.

Not necessarily rewriting but it will require some work to pipe through related

GnatorX avatar Mar 05 '25 00:03 GnatorX

Just to add, EventRecorder indeed doesn't have related but it is a pretty basic helper for emitting events. You can see comparing it's interface with the spec of event it's missing a lot. But it doesn't really matter because Karpenter use the Event object directly and doesn't rely on EventRecorder

GnatorX avatar Mar 05 '25 03:03 GnatorX

Just to add, EventRecorder indeed doesn't have related but it is a pretty basic helper for emitting events. You can see comparing it's interface with the spec of event it's missing a lot. But it doesn't really matter because Karpenter use the Event object directly and doesn't rely on EventRecorder

Incorrect...

Karpenter's event object is a custom internal struct, only loosely based on the Kubernetes event object.

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/events/recorder.go#L30-L38

I also took a stab at this and in my testing the Related field is not visible to users and does not produce any breadcrumbs or way to tie this to the user's namespace or pod. So I am not sure it's an answer to this problem.

cnmcavoy avatar Apr 21 '25 23:04 cnmcavoy

Ah yes you are right. In that case you would have to migrate from client-go/tools/record/event's Event recorder to client-go/tools/events/event's event record https://github.com/kubernetes/client-go/blob/master/tools/events/event_recorder.go#L45. But this would entail upgrading to events/v1 vs core/v1 for the event object.

If you don't want to do that, then you might have to stop using EventRecorder

GnatorX avatar Apr 22 '25 01:04 GnatorX