autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Add metrics for tracking live servers

Open iainlane opened this issue 2 years ago • 6 comments

Problem

Occasionally we've noticed that the autoscalers know about instances that the VM providers (e.g. GCP or AWS) don't or vice versa, probably due to latent bugs in the autoscaler. I'd like to expose the servers from the autoscaler and then we can correlate both sides to have an alert when they get out of sync. Then we should be able to look in logs and hopefully identify something that can help get the bug(s) fixed. Currently, we might not notice until a bit later and then it's hard to know where to drill down to.

Proposal

Add a new metric drone_server_known_instance with a label for the instance name.

This should allow us to correlate the servers that drone thinks it knows about with those that GCP has.

Technically this has unbounded cardinality but I think it's okay in reality since the metric is deleted when a server goes away and we'll really only be doing instant ("give me the values now") queries on this.

Would be interested in people's thoughts on this.

Alternatives

drone_server_count

There is already drone_server_count which could gain some labels.

  1. I'm not sure if that counts as an "API break", could it make existing queries stop working? Actually don't know the answer to that one.
  2. Would cardinality be more of a problem there, is that a relevant concern?

Doing something with logs

For completeness-

We could make sure to log nice clear messages when servers turn up and leave.

  1. It'll be annoying to correlate if done manually
  2. Unless we use a recording rule somehow
  3. Which then has the same problems as the others, so why not use a metric directly?

iainlane avatar Mar 18 '22 17:03 iainlane

I'll fix up the test failures in a few days when I get a chance, still would appreciate feedback on the approach 🙂

iainlane avatar Mar 18 '22 17:03 iainlane

tests are fixed!

iainlane avatar Mar 29 '22 17:03 iainlane

@eoinmccafee00 I think that is about a different issue, can you re-check please?

iainlane avatar Jun 15 '22 07:06 iainlane

@eoinmccafee00 I think that is about a different issue, can you re-check please?

Apologizes yeah I closed the wrong ticket.

eoinmcafee00 avatar Jun 15 '22 07:06 eoinmcafee00

Hey @iainlane

Can you wrap a feature flag around this, please? I'd rather not have it enabled by default for now.

Cheers, Eoin

eoinmcafee00 avatar Jun 15 '22 10:06 eoinmcafee00

@eoinmcafee00 okay, re-pushed

I checked this works as expected, with the flag set we get metrics like

# HELP drone_server_known_instance Known server instances.
# TYPE drone_server_known_instance gauge
drone_server_known_instance{name="drone-linux-amd64-SeNtfUdr",provider="google",region="us-central1-a",size="e2-standard-2"} 1

I should be able to join that with the lists from the CSPs & we'll be able to see if one side knows about an instance the other doesn't.

iainlane avatar Jul 21 '22 13:07 iainlane