machine-controller
machine-controller copied to clipboard
re-add the metric `node_join_duration`
Hi,
the metric node_join_duration was added in #49 and removed in #270, together with a bunch of other metrics as "broken histograms". While this looks true for the controller_operation_duration (as discussed in #271), this still looks like a useful metric to have.
I'll happily create the PR to re-add it, if you find it useful, too. Adding it with the same labels as machine_controller_machines (so kubelet_version, os and provider) as histogram. I added a similar, anexia-specific metric in my local dev version but would rather upstream a more generic feature :)
@LittleFox94 thanks for the suggestion, how are you using that metric?
we currently have the metrics machine_controller_anexia_vm_provisioning_duration_seconds{location,template} and machine_controller_anexia_vm_deprovisioning_duration_seconds{location}, both being histograms and use them to track how long it takes from starting to provision a new machine to it being usable / how long it takes from starting to deprovision a machine to it actually being gone.
Creating (and sometimes destroying) VMs is a process that can take a variable amount of time, we want to make sure we can react if it takes unexpectedly long as that might point to some kind of problem.
I'm pretty sure it's not only relevant for us, think of EC2 Spot instances and configuring a max instance price that's below the current Spot price. Not sure if using Spot instances is supported by machine-controller already, but this would be another use.
We basically want an alert on new machines taking a long time to join as Node.
Issues go stale after 90d of inactivity.
After a furter 30 days, they will turn rotten.
Mark the issue as fresh with /remove-lifecycle stale.
If this issue is safe to close now please do so with /close.
/lifecycle stale
@kron4eg did this answer your question, do you think this is useful to have in machine-controller?
@LittleFox94 sorry for not replying earlier, I've completely lost track of this issue :shrug:
I don't see reasons against new metrics if it helps better observability!
Alright, I'll plan in our team to prepare a nice PR :)