harmony
harmony copied to clipboard
Prometheus metrics seems wrong
The rpc call metrics in our grafana dashboard is showing hundreds of millions of getLog calls every 5 mins: https://monitor.harmony.one/d/joo9Q1m7z/s0-explorer-rpc-metrics?orgId=1&from=now-15d&to=now
This number is way out of range where on the cloud service side we are seeing overall calls to all rpcs is around 100s millions per HOUR.
We need to look into the prometheus metrics and the reporting to fix the stats data issues.
It seems there were some sort of aggregation on metrics. We found out that some of the nodes are using same instance id. All of the metrics firstly are grouping by instance id. Soph has helped to fix the instance id (by removing the key). So now, each nodes has its own unique instance id and there is not any instance id duplication in push gateway. After applying this fix, we see all the metrics are steady and the values are in proper range. We monitor the metrics for a few days to make sure any impulse or spike won't happen in metrics again. The results show that metrics work properly now.