autoscaler
autoscaler copied to clipboard
Add metrics for tracking live servers
Problem
Occasionally we've noticed that the autoscalers know about instances that the VM providers (e.g. GCP or AWS) don't or vice versa, probably due to latent bugs in the autoscaler. I'd like to expose the servers from the autoscaler and then we can correlate both sides to have an alert when they get out of sync. Then we should be able to look in logs and hopefully identify something that can help get the bug(s) fixed. Currently, we might not notice until a bit later and then it's hard to know where to drill down to.
Proposal
Add a new metric drone_server_known_instance
with a label for the instance name.
This should allow us to correlate the servers that drone thinks it knows about with those that GCP has.
Technically this has unbounded cardinality but I think it's okay in reality since the metric is deleted when a server goes away and we'll really only be doing instant ("give me the values now") queries on this.
Would be interested in people's thoughts on this.
Alternatives
drone_server_count
There is already drone_server_count
which could gain some labels.
- I'm not sure if that counts as an "API break", could it make existing queries stop working? Actually don't know the answer to that one.
- Would cardinality be more of a problem there, is that a relevant concern?
Doing something with logs
For completeness-
We could make sure to log nice clear messages when servers turn up and leave.
- It'll be annoying to correlate if done manually
- Unless we use a recording rule somehow
- Which then has the same problems as the others, so why not use a metric directly?
I'll fix up the test failures in a few days when I get a chance, still would appreciate feedback on the approach 🙂
tests are fixed!
@eoinmccafee00 I think that is about a different issue, can you re-check please?
@eoinmccafee00 I think that is about a different issue, can you re-check please?
Apologizes yeah I closed the wrong ticket.
Hey @iainlane
Can you wrap a feature flag around this, please? I'd rather not have it enabled by default for now.
Cheers, Eoin
@eoinmcafee00 okay, re-pushed
I checked this works as expected, with the flag set we get metrics like
# HELP drone_server_known_instance Known server instances.
# TYPE drone_server_known_instance gauge
drone_server_known_instance{name="drone-linux-amd64-SeNtfUdr",provider="google",region="us-central1-a",size="e2-standard-2"} 1
I should be able to join that with the lists from the CSPs & we'll be able to see if one side knows about an instance the other doesn't.