kured
kured copied to clipboard
Add new kured metrics
Add metrics:
- kured_lock_annotation
- kured_lock_held
- kured_node_draining
- kured_reboot_blocked
- kured_reboot_window_active
This also reduces the interval which metrics are collected from 60s to 15s (4x increase). This reduces the chance of lost metrics during a mismatch of the metrics collection frequency and the metrics scrape frequency.
Good idea. Let's talk about it in the next community meeting.
This would solve some of my concerns in #1156. How does this approach handle conflicting values for the lock_annotation? Won't the potentially different update interval of each pod not result in lock_annotation.node beeing reported differently by all pods? e.g. Node1 takes the lock -> all pods scrape -> Node 2 takes the lock -> first pod scrapes, but all other pods wait 2 secs for next scrape -> Prometheus metric is incosistent?
@evrardjp what would be needed to get this merged?
@localleon thanks for the reminder. Regarding the potential to have inconsistent metrics reported, I did consider that but I think the current implementation provides visibility for kured pods that aren't correctly tracking the lock holder. That should ideally never happen so I'm not convinced that this is the correct implementation. Absolutely open to alternatives if you have strong feelings about it.
@evrardjp happy to discuss in the next community meeting or an adhoc call. We can coordinate in the CNCF slack channel on it if you'd like.
@evrardjp hope your doing well! Is there anything I could do so we could get this feature merged?
I am in the middle of the big rewrite of v2 (you can see the first steps in #1000). V2 is also documented in our channel, but I guess I should put a project in github ...
Anyway, long story short, most of those metrics won't make sense anymore. If you're looking into observability, v2 will bring other methods to see what's going on.
Let's go through each metric:
- kured_lock_annotation --> We won't need locks anymore: There will be a shared queue or a lease system between rebooters
- kured_lock_held --> Same comment
- kured_node_draining --> You will be able to observe this through node conditions
- kured_reboot_blocked --> You will be able to observe this through node conditions
- kured_reboot_window_active --> This is something I need to figure out with you. Would you rather have something on the node that says "reboot window is active for this node" (condition/metric/..) or would you prefer that maintenances have their own CRD, which report such status? This was not decided in the v2 document.
Any opinion on this @localleon @colinmcintosh ?
Thanks @evrardjp for the detailled reply! I assumed the PR #1000 was inactive and did not realize that there was a v2 in the works for this project! A Github Project board would probaly be a good idea if your are looking for contributions!
The metrics look good! About kured_reboot_window_active..
- I currently like the simplicity of kured and that it's not using CRDs. I think if possible, i would keep it this way!
- A metric on each node that says reboot windows is active for this node would be great! From my current understanding we can only configure reboot windows for all nodes at the same time. There are i think two options:
-
- Indicate per node with a 0/1 metric if it's currently able to accept a reboot command
-
- Make a global metric with 0/1 that indicates if all nodes are currently rebootable (this would make more sense since we do not have individual reboot windows)