harmony Prometheus metrics seems wrong

Prometheus metrics seems wrong

Open rlan35 opened this issue 2 years ago • 1 comments

The rpc call metrics in our grafana dashboard is showing hundreds of millions of getLog calls every 5 mins: https://monitor.harmony.one/d/joo9Q1m7z/s0-explorer-rpc-metrics?orgId=1&from=now-15d&to=now

This number is way out of range where on the cloud service side we are seeing overall calls to all rpcs is around 100s millions per HOUR.

We need to look into the prometheus metrics and the reporting to fix the stats data issues.

May 03 '22 23:05 rlan35

It seems there were some sort of aggregation on metrics. We found out that some of the nodes are using same instance id. All of the metrics firstly are grouping by instance id. Soph has helped to fix the instance id (by removing the key). So now, each nodes has its own unique instance id and there is not any instance id duplication in push gateway. After applying this fix, we see all the metrics are steady and the values are in proper range. We monitor the metrics for a few days to make sure any impulse or spike won't happen in metrics again. The results show that metrics work properly now.

May 10 '22 08:05 GheisMohammadi

harmony harmony copied to clipboard

Prometheus metrics seems wrong

harmony
harmony copied to clipboard