kured icon indicating copy to clipboard operation
kured copied to clipboard

Add new kured metrics

Open colinmcintosh opened this issue 4 months ago • 7 comments

Add metrics:

  • kured_lock_annotation
  • kured_lock_held
  • kured_node_draining
  • kured_reboot_blocked
  • kured_reboot_window_active

This also reduces the interval which metrics are collected from 60s to 15s (4x increase). This reduces the chance of lost metrics during a mismatch of the metrics collection frequency and the metrics scrape frequency.

colinmcintosh avatar Jul 24 '25 04:07 colinmcintosh

Good idea. Let's talk about it in the next community meeting.

evrardjp avatar Aug 01 '25 05:08 evrardjp

This would solve some of my concerns in #1156. How does this approach handle conflicting values for the lock_annotation? Won't the potentially different update interval of each pod not result in lock_annotation.node beeing reported differently by all pods? e.g. Node1 takes the lock -> all pods scrape -> Node 2 takes the lock -> first pod scrapes, but all other pods wait 2 secs for next scrape -> Prometheus metric is incosistent?

localleon avatar Aug 25 '25 12:08 localleon

@evrardjp what would be needed to get this merged?

localleon avatar Sep 08 '25 11:09 localleon

@localleon thanks for the reminder. Regarding the potential to have inconsistent metrics reported, I did consider that but I think the current implementation provides visibility for kured pods that aren't correctly tracking the lock holder. That should ideally never happen so I'm not convinced that this is the correct implementation. Absolutely open to alternatives if you have strong feelings about it.

@evrardjp happy to discuss in the next community meeting or an adhoc call. We can coordinate in the CNCF slack channel on it if you'd like.

colinmcintosh avatar Sep 08 '25 11:09 colinmcintosh

@evrardjp hope your doing well! Is there anything I could do so we could get this feature merged?

localleon avatar Oct 16 '25 14:10 localleon

I am in the middle of the big rewrite of v2 (you can see the first steps in #1000). V2 is also documented in our channel, but I guess I should put a project in github ...

Anyway, long story short, most of those metrics won't make sense anymore. If you're looking into observability, v2 will bring other methods to see what's going on.

Let's go through each metric:

  • kured_lock_annotation --> We won't need locks anymore: There will be a shared queue or a lease system between rebooters
  • kured_lock_held --> Same comment
  • kured_node_draining --> You will be able to observe this through node conditions
  • kured_reboot_blocked --> You will be able to observe this through node conditions
  • kured_reboot_window_active --> This is something I need to figure out with you. Would you rather have something on the node that says "reboot window is active for this node" (condition/metric/..) or would you prefer that maintenances have their own CRD, which report such status? This was not decided in the v2 document.

Any opinion on this @localleon @colinmcintosh ?

evrardjp avatar Oct 16 '25 21:10 evrardjp

Thanks @evrardjp for the detailled reply! I assumed the PR #1000 was inactive and did not realize that there was a v2 in the works for this project! A Github Project board would probaly be a good idea if your are looking for contributions!

The metrics look good! About kured_reboot_window_active..

  • I currently like the simplicity of kured and that it's not using CRDs. I think if possible, i would keep it this way!
  • A metric on each node that says reboot windows is active for this node would be great! From my current understanding we can only configure reboot windows for all nodes at the same time. There are i think two options:
    • Indicate per node with a 0/1 metric if it's currently able to accept a reboot command
    • Make a global metric with 0/1 that indicates if all nodes are currently rebootable (this would make more sense since we do not have individual reboot windows)

localleon avatar Oct 17 '25 12:10 localleon