guild-operators icon indicating copy to clipboard operation
guild-operators copied to clipboard

Feature Request: Lightweight CNCLI Blocks Prometheus Exporter

Open Straightpool opened this issue 2 years ago • 0 comments

Feature description A lightweight Prometheus exporter for the CNCLI "Blocks" data. I propose the following metrics, always in relation to the current active epoch for monitoring purposes:

  • cntools_cncli_blocks_metrics_next_leader_time_utc
  • cntools_cncli_blocks_metrics_next_next_leader_time_utc
  • cntools_cncli_blocks_metrics_ideal
  • cntools_cncli_blocks_metrics_luck
  • cntools_cncli_blocks_metrics_adopted_total
  • cntools_cncli_blocks_metrics_confirmed_total
  • cntools_cncli_blocks_metrics_missed_total
  • cntools_cncli_blocks_metrics_ghosted_total
  • cntools_cncli_blocks_metrics_stolen_total
  • cntools_cncli_blocks_metrics_invalid_total
  • cntools_cncli_blocks_metrics_adopted_max_consec
  • cntools_cncli_blocks_metrics_confirmed_max_consec
  • cntools_cncli_blocks_metrics_missed_max_consec
  • cntools_cncli_blocks_metrics_ghosted_max_consec
  • cntools_cncli_blocks_metrics_stolen_max_consec
  • cntools_cncli_blocks_metrics_invalid_max_consec

Next_leader_time_UTC returns the UTC time of the next leader slot Next_next_leader_time_UTC returns the UTC time of the leader slot after next

*_total refers to the total number at the current time. *_max_consec refers to the max consecutive occurrence of said block state.

Example Example block sequence:

  1. confirmed
  2. confirmed
  3. stolen
  4. confirmed
  5. ghosted
  6. confirmed
  7. confirmed
  8. confirmed
  9. missed
  10. confirmed
  11. confirmed
  12. confirmed
  13. confirmed
  14. missed
  15. missed
  16. missed
  17. invalid

Results in:

  • cntools_cncli_blocks_metrics_adopted_total: 0
  • cntools_cncli_blocks_metrics_confirmed_total: 10
  • cntools_cncli_blocks_metrics_missed_total: 4
  • cntools_cncli_blocks_metrics_ghosted_total: 1
  • cntools_cncli_blocks_metrics_stolen_total: 1
  • cntools_cncli_blocks_metrics_invalid_total: 1
  • cntools_cncli_blocks_metrics_adopted_max_consec: 0
  • cntools_cncli_blocks_metrics_confirmed_max_consec: 4
  • cntools_cncli_blocks_metrics_missed_max_consec: 3
  • cntools_cncli_blocks_metrics_ghosted_max_consec: 1
  • cntools_cncli_blocks_metrics_stolen_max_consec: 1
  • cntools_cncli_blocks_metrics_invalid_max_consec: 1

Rationale Sometimes out of the blue errors or bugs can occur in the pool infrastructure or Cardano node itself which can lead to a number of consecutive lost blocks. A single missed block is currently usually a false classification and rather a ghosted block. A single ghosted block is usually a race condition currently. However, if any of these error states do happen in multiples in direct succession something is clearly off though and ought to trigger an alert / action based on Prometheus / Grafana rules. In the example "cntools_cncli_blocks_metrics_missed_max_consec: 3" would be such a case which would warrant an alert / action or cntools_cncli_blocks_metrics_invalid_total =! 0. Without a Prometheus Exporter it might take a while to notice the issue, especailly if it triggers no other alerts, with more lost blocks than necessary.

Possible implementation approaches From first looks this seems to be a feasible architecture with SQL scraping:

  • Build upon an open source extensible sql-query Prometheus exporter such as: https://github.com/albertodonato/query-exporter
  • To determine consecutive runs a SQL count partition query should deliver expected results, see e.g. https://stackoverflow.com/questions/36927685/count-number-of-consecutive-occurrence-of-values-in-table

Another approach could be to use Bash pushing:

  • Employ a Prometheus PushGateway to use Bash scripts to push an update when there is an update on a block, see e.g. https://medium.com/avmconsulting-blog/pushing-bash-script-result-to-prometheus-using-pushgateway-a0760cd261e
  • See https://prometheus.io/docs/practices/pushing/ on the implications

Considered alternatives None

Version:

  • OS: Ubunto 20.04 LTS
  • Product version: CNTools 9.1.0
  • Cardano Node version: cardano-node 1.34.1 - linux-x86_64 - ghc-8.10 git rev 73f9a746362695dc2cb63ba757fbcabb81733d23
  • Network you're connecting to: Mainnet

Straightpool avatar Jun 25 '22 18:06 Straightpool