temporal icon indicating copy to clipboard operation
temporal copied to clipboard

Emit queue heartbeat metric for monitoring queue processor liveness

Open yycptt opened this issue 1 year ago • 3 comments

What changed?

  • Emit queue heartbeat duration metric

Why?

  • For monitoring if queue processor accidentally got shutdown due to a bug.
  • We previously rely on queue backlog metric for this purpose, but queue ack level may not move and backlog could be keep increasing for some expected cases (e.g. a namespace get throttled or due to replication delay). We still want the lag metric to be visibility for understanding the backlog, but we need to new metric for monitoring queue liveness.

How did you test it?

  • WIP (will run server locally and check the metric)

Potential risks

  • Some extra metric load. But the increase should be small as each shard only emit one data pointer per queue every 5mins.

Documentation

Is hotfix candidate?

  • NO.

yycptt avatar Jan 17 '25 22:01 yycptt

But the increase should be small as each shard only emit one data pointer per queue every 5mins.

Is that true? Regardless of how it's emitted, it still gets scraped regularly, right?

Also, I would have thought a gauge of some kind would make more sense for monitoring liveness. It's fewer timeseries too. Although if you want the histogram for other reasons that makes sense

dnr avatar Jan 20 '25 22:01 dnr

But the increase should be small as each shard only emit one data pointer per queue every 5mins.

Is that true? Regardless of how it's emitted, it still gets scraped regularly, right?

Also, I would have thought a gauge of some kind would make more sense for monitoring liveness. It's fewer timeseries too. Although if you want the histogram for other reasons that makes sense

The metric is emitted every queueMetricUpdateInterval which is 5mins. But yeah you are right, each scrape still get the data :(. Good news is that we don't have any high cardinality tag on the metric, so it will just be one number for each bucket we configured * # of queues, so not that bad. And I can reduce the number of buckets we use for this metric.

I think the issue with gauge is that the liveness is per shard per queue, so I will need to add a shardID tag as well to the metric, which will cause even more data to be scraped every time (# of shards * # of queues).

One thing I need to add to the histogram approach is to emit a log with the shardID tag.

yycptt avatar Jan 27 '25 23:01 yycptt

This PR was marked as stale. Please update or close it.

github-actions[bot] avatar May 28 '25 00:05 github-actions[bot]