guild-operators Feature Request: Lightweight CNCLI Blocks Prometheus Exporter

Feature Request: Lightweight CNCLI Blocks Prometheus Exporter

Open Straightpool opened this issue 2 years ago • 0 comments

Feature description A lightweight Prometheus exporter for the CNCLI "Blocks" data. I propose the following metrics, always in relation to the current active epoch for monitoring purposes:

cntools_cncli_blocks_metrics_next_leader_time_utc
cntools_cncli_blocks_metrics_next_next_leader_time_utc
cntools_cncli_blocks_metrics_ideal
cntools_cncli_blocks_metrics_luck
cntools_cncli_blocks_metrics_adopted_total
cntools_cncli_blocks_metrics_confirmed_total
cntools_cncli_blocks_metrics_missed_total
cntools_cncli_blocks_metrics_ghosted_total
cntools_cncli_blocks_metrics_stolen_total
cntools_cncli_blocks_metrics_invalid_total
cntools_cncli_blocks_metrics_adopted_max_consec
cntools_cncli_blocks_metrics_confirmed_max_consec
cntools_cncli_blocks_metrics_missed_max_consec
cntools_cncli_blocks_metrics_ghosted_max_consec
cntools_cncli_blocks_metrics_stolen_max_consec
cntools_cncli_blocks_metrics_invalid_max_consec

Next_leader_time_UTC returns the UTC time of the next leader slot Next_next_leader_time_UTC returns the UTC time of the leader slot after next

*_total refers to the total number at the current time. *_max_consec refers to the max consecutive occurrence of said block state.

Example Example block sequence:

confirmed
confirmed
stolen
confirmed
ghosted
confirmed
confirmed
confirmed
missed
confirmed
confirmed
confirmed
confirmed
missed
missed
missed
invalid

Results in:

cntools_cncli_blocks_metrics_adopted_total: 0
cntools_cncli_blocks_metrics_confirmed_total: 10
cntools_cncli_blocks_metrics_missed_total: 4
cntools_cncli_blocks_metrics_ghosted_total: 1
cntools_cncli_blocks_metrics_stolen_total: 1
cntools_cncli_blocks_metrics_invalid_total: 1
cntools_cncli_blocks_metrics_adopted_max_consec: 0
cntools_cncli_blocks_metrics_confirmed_max_consec: 4
cntools_cncli_blocks_metrics_missed_max_consec: 3
cntools_cncli_blocks_metrics_ghosted_max_consec: 1
cntools_cncli_blocks_metrics_stolen_max_consec: 1
cntools_cncli_blocks_metrics_invalid_max_consec: 1

Rationale Sometimes out of the blue errors or bugs can occur in the pool infrastructure or Cardano node itself which can lead to a number of consecutive lost blocks. A single missed block is currently usually a false classification and rather a ghosted block. A single ghosted block is usually a race condition currently. However, if any of these error states do happen in multiples in direct succession something is clearly off though and ought to trigger an alert / action based on Prometheus / Grafana rules. In the example "cntools_cncli_blocks_metrics_missed_max_consec: 3" would be such a case which would warrant an alert / action or cntools_cncli_blocks_metrics_invalid_total =! 0. Without a Prometheus Exporter it might take a while to notice the issue, especailly if it triggers no other alerts, with more lost blocks than necessary.

Possible implementation approaches From first looks this seems to be a feasible architecture with SQL scraping:

Build upon an open source extensible sql-query Prometheus exporter such as: https://github.com/albertodonato/query-exporter
To determine consecutive runs a SQL count partition query should deliver expected results, see e.g. https://stackoverflow.com/questions/36927685/count-number-of-consecutive-occurrence-of-values-in-table

Another approach could be to use Bash pushing:

Employ a Prometheus PushGateway to use Bash scripts to push an update when there is an update on a block, see e.g. https://medium.com/avmconsulting-blog/pushing-bash-script-result-to-prometheus-using-pushgateway-a0760cd261e
See https://prometheus.io/docs/practices/pushing/ on the implications

Considered alternatives None

Version:

OS: Ubunto 20.04 LTS
Product version: CNTools 9.1.0
Cardano Node version: cardano-node 1.34.1 - linux-x86_64 - ghc-8.10 git rev 73f9a746362695dc2cb63ba757fbcabb81733d23
Network you're connecting to: Mainnet

Jun 25 '22 18:06 Straightpool

guild-operators guild-operators copied to clipboard

Feature Request: Lightweight CNCLI Blocks Prometheus Exporter

guild-operators
guild-operators copied to clipboard