rabbitmq-server
rabbitmq-server copied to clipboard
Metric for queue availability
@ansd commented on Tue Mar 02 2021
In https://github.com/rabbitmq/rabbitmq-server/pull/2858, we spiked a metric for availability of a quorum queue.
In this PR, on every RabbitMQ node with the quorum queue leader, we periodically ra:consistent_query().
This query reaches out to all quorum queue replicas and awaits a response from a majority. We learnt that such periodic polling is too expensive with many (~10k) quorum queues.
Instead of introducing expensive periodic polling, we can still expose a metric for queue availability.
As of today, every queue has a state in an ETS table. If the state is running, we consider the queue to be up (available). Although this doesn't catch all scenarios of a queue being down (e.g. in the context of quorum queues only a minority of replicas might be available and the state would still show running), we can still expose this metric. In the PR, this metric is added by this line.
Create another PR which adds this single metric.
When we aggregate metrics, this new queues_up metric is the sum of all queue being up on the given RabbitMQ node.
When we report per-object metrics, this queues_up metric reports whether a particular queue is available.
Such a metric can be used for alerting and to define Service Level Indicators (SLIs).