machine-controller icon indicating copy to clipboard operation
machine-controller copied to clipboard

re-add the metric `node_join_duration`

Open LittleFox94 opened this issue 2 years ago • 6 comments

Hi,

the metric node_join_duration was added in #49 and removed in #270, together with a bunch of other metrics as "broken histograms". While this looks true for the controller_operation_duration (as discussed in #271), this still looks like a useful metric to have.

I'll happily create the PR to re-add it, if you find it useful, too. Adding it with the same labels as machine_controller_machines (so kubelet_version, os and provider) as histogram. I added a similar, anexia-specific metric in my local dev version but would rather upstream a more generic feature :)

LittleFox94 avatar Jun 21 '22 13:06 LittleFox94

@LittleFox94 thanks for the suggestion, how are you using that metric?

kron4eg avatar Jun 22 '22 10:06 kron4eg

we currently have the metrics machine_controller_anexia_vm_provisioning_duration_seconds{location,template} and machine_controller_anexia_vm_deprovisioning_duration_seconds{location}, both being histograms and use them to track how long it takes from starting to provision a new machine to it being usable / how long it takes from starting to deprovision a machine to it actually being gone.

Creating (and sometimes destroying) VMs is a process that can take a variable amount of time, we want to make sure we can react if it takes unexpectedly long as that might point to some kind of problem.

I'm pretty sure it's not only relevant for us, think of EC2 Spot instances and configuring a max instance price that's below the current Spot price. Not sure if using Spot instances is supported by machine-controller already, but this would be another use.

We basically want an alert on new machines taking a long time to join as Node.

LittleFox94 avatar Jun 24 '22 10:06 LittleFox94

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubermatic-bot avatar Sep 22 '22 19:09 kubermatic-bot

@kron4eg did this answer your question, do you think this is useful to have in machine-controller?

LittleFox94 avatar Sep 23 '22 10:09 LittleFox94

@LittleFox94 sorry for not replying earlier, I've completely lost track of this issue :shrug:

I don't see reasons against new metrics if it helps better observability!

kron4eg avatar Sep 23 '22 12:09 kron4eg

Alright, I'll plan in our team to prepare a nice PR :)

LittleFox94 avatar Sep 23 '22 13:09 LittleFox94