guild-operators
guild-operators copied to clipboard
Feature Request: Lightweight CNCLI Blocks Prometheus Exporter
Feature description A lightweight Prometheus exporter for the CNCLI "Blocks" data. I propose the following metrics, always in relation to the current active epoch for monitoring purposes:
- cntools_cncli_blocks_metrics_next_leader_time_utc
- cntools_cncli_blocks_metrics_next_next_leader_time_utc
- cntools_cncli_blocks_metrics_ideal
- cntools_cncli_blocks_metrics_luck
- cntools_cncli_blocks_metrics_adopted_total
- cntools_cncli_blocks_metrics_confirmed_total
- cntools_cncli_blocks_metrics_missed_total
- cntools_cncli_blocks_metrics_ghosted_total
- cntools_cncli_blocks_metrics_stolen_total
- cntools_cncli_blocks_metrics_invalid_total
- cntools_cncli_blocks_metrics_adopted_max_consec
- cntools_cncli_blocks_metrics_confirmed_max_consec
- cntools_cncli_blocks_metrics_missed_max_consec
- cntools_cncli_blocks_metrics_ghosted_max_consec
- cntools_cncli_blocks_metrics_stolen_max_consec
- cntools_cncli_blocks_metrics_invalid_max_consec
Next_leader_time_UTC returns the UTC time of the next leader slot Next_next_leader_time_UTC returns the UTC time of the leader slot after next
*_total refers to the total number at the current time. *_max_consec refers to the max consecutive occurrence of said block state.
Example Example block sequence:
- confirmed
- confirmed
- stolen
- confirmed
- ghosted
- confirmed
- confirmed
- confirmed
- missed
- confirmed
- confirmed
- confirmed
- confirmed
- missed
- missed
- missed
- invalid
Results in:
- cntools_cncli_blocks_metrics_adopted_total: 0
- cntools_cncli_blocks_metrics_confirmed_total: 10
- cntools_cncli_blocks_metrics_missed_total: 4
- cntools_cncli_blocks_metrics_ghosted_total: 1
- cntools_cncli_blocks_metrics_stolen_total: 1
- cntools_cncli_blocks_metrics_invalid_total: 1
- cntools_cncli_blocks_metrics_adopted_max_consec: 0
- cntools_cncli_blocks_metrics_confirmed_max_consec: 4
- cntools_cncli_blocks_metrics_missed_max_consec: 3
- cntools_cncli_blocks_metrics_ghosted_max_consec: 1
- cntools_cncli_blocks_metrics_stolen_max_consec: 1
- cntools_cncli_blocks_metrics_invalid_max_consec: 1
Rationale Sometimes out of the blue errors or bugs can occur in the pool infrastructure or Cardano node itself which can lead to a number of consecutive lost blocks. A single missed block is currently usually a false classification and rather a ghosted block. A single ghosted block is usually a race condition currently. However, if any of these error states do happen in multiples in direct succession something is clearly off though and ought to trigger an alert / action based on Prometheus / Grafana rules. In the example "cntools_cncli_blocks_metrics_missed_max_consec: 3" would be such a case which would warrant an alert / action or cntools_cncli_blocks_metrics_invalid_total =! 0. Without a Prometheus Exporter it might take a while to notice the issue, especailly if it triggers no other alerts, with more lost blocks than necessary.
Possible implementation approaches From first looks this seems to be a feasible architecture with SQL scraping:
- Build upon an open source extensible sql-query Prometheus exporter such as: https://github.com/albertodonato/query-exporter
- To determine consecutive runs a SQL count partition query should deliver expected results, see e.g. https://stackoverflow.com/questions/36927685/count-number-of-consecutive-occurrence-of-values-in-table
Another approach could be to use Bash pushing:
- Employ a Prometheus PushGateway to use Bash scripts to push an update when there is an update on a block, see e.g. https://medium.com/avmconsulting-blog/pushing-bash-script-result-to-prometheus-using-pushgateway-a0760cd261e
- See https://prometheus.io/docs/practices/pushing/ on the implications
Considered alternatives None
Version:
- OS: Ubunto 20.04 LTS
- Product version: CNTools 9.1.0
- Cardano Node version: cardano-node 1.34.1 - linux-x86_64 - ghc-8.10 git rev 73f9a746362695dc2cb63ba757fbcabb81733d23
- Network you're connecting to: Mainnet