amazon-vpc-cni-k8s
amazon-vpc-cni-k8s copied to clipboard
Counters reported as Gauges in Prometheus metrics
What happened: Some of the Prometheus metrics exported by the VPC CNI plugin are defined with inaccurate metric types. For example:
https://github.com/aws/amazon-vpc-cni-k8s/blob/27ce1362636567592f006b987f3820c6b0fef55e/utils/prometheusmetrics/prometheusmetrics.go#L64
This metric (awscni_add_ip_req_count) is exported as a gauge but it has cumulative incremental values. In fact, it seems that it's used as a counter in:
https://github.com/aws/amazon-vpc-cni-k8s/blob/27ce1362636567592f006b987f3820c6b0fef55e/pkg/ipamd/rpc_handler.go#L70
It seems that awscni_del_ip_req_count is correctly exported as a counter.
I probably don't have enough context on this to make a judgement call. However, I think there are probably more Gauges that are operating as Counters.
Attach logs N/A
What you expected to happen: I'd expect metrics to follow the semantic conventions defined in https://prometheus.io/docs/concepts/metric_types/
How to reproduce it (as minimally and precisely as possible): Using Prometheus exporters.
Anything else we need to know?: This may not be a critical issues if systems use Prometheus as the backend. However, it becomes a problem when Prometheus metrics are transformed into other representations. For example, OpenTelemetry Collectors will read this as a Gauge and that gives the aggregation a different meaning (e.g. one can change temporality of counters from cumulative to delta or viceversa).
Environment:
- Kubernetes version (use
kubectl version): 1.28.12 - CNI Version: 1.16.3
- OS (e.g:
cat /etc/os-release): Bottlerocket 1.21.0 - Kernel (e.g.
uname -a): x86_64 GNU/Linux