Enhancing Network Monitoring: Including Upgrade-Specific Metrics in Prometheus
Issue Description:
In our continuous effort to improve network management and decision-making processes, we have identified a crucial need for incorporating upgrade-specific metrics into our Prometheus monitoring system. This enhancement aims to provide a comprehensive and authoritative view of network upgrades, addressing current discrepancies observed through our a3 console and pvtop toolsets, and ensuring alignment with ongoing network changes.
Proposed Metrics Integration:
To achieve this, we propose adding the following metrics to the Prometheus metrics suite, particularly focusing on the network running on telemetry port 26660:
- Number of Validators Online: This metric will track the total count of active validators within the network, providing a real-time view of network strength and reliability.
- Number of Validators Offline: Conversely, this will monitor the count of validators that are not currently active, offering insights into network issues or maintenance statuses.
- Percentage of Validators Online: This percentage will give a quick overview of network health by showing the ratio of online validators to the total validators.
- Percentage of Validators Offline: Complementing the above, this metric will display the proportion of offline validators, helping in assessing network vulnerabilities.
- Consensus Percentage Vote on a Block Based on Delegation Power: This metric will measure the percentage of validators that have voted for a particular block, indicating the level of agreement or disagreement among validators.
- Consensus Percentage Commit on a Block Based on Delegation Power: Similarly, this will track the percentage of validators that have committed to a block, further elucidating the consensus stability for each block.
These metrics are essential for accurate, real-time monitoring and decision-making regarding network upgrades and maintenance. By integrating them into our Prometheus setup, we aim to resolve current tool discrepancies and enhance our network's operational transparency and efficiency.
These are very desirable results, however it'll be quite challenging to get definitive or reliable answers to these questions.
The key difficulty is that this is a distributed consensus protocol with arbitrary failure modes of its participants.
It may be possible to get suggestive answers similar to what we can guess from the pvtop and the consensus page
from tendermint 26656 port.
My recommendation is to have a more streamlined and contextual presentation of the information from the consensus page 26656 to aid human interpretation.
Is this still needed , I belive we have most of the data neeed to monitor status of valdiators during upgrade
@lumtis @CharlieMc0 , if ther3 are any specific metrics needed , please update the description and I can work on closing this issue out
If we could get the voting power participating in each round emitted as a metric on each node that would be nice to have but not that important. Mostly important during upgrades.
"Upgrade v99.99.9 -- Node A sees 45% of VP online but Node B only sees 25%. Why is Node B seeing different values"
I don't think it's worth spending a lot of time on that though if the data isn't easily available.
@morde08 anything specific you want to see?
THe voting power can be fetched in two ways
-
Lastblock: fetch who voted in the last block and add metrics for that. I don't think this is useful for us , since during an upgrade we normally get stuck producing the block after the upgrade
-
NextBlock : check the votes in Proposal , Precommit, Prevote stages . Since the block is not created yet , this information should be fetched from the comet bft consensus reactor , and the those values are already being provided to the metrics server . I think we can just try putting the required values on a dashboard, if we don't have it already.
@CharlieMc0 @morde08 , I am marking this as closed , the comet bft metrics would be the best place to monitor block production which we are already doing