nvidia gpu count metrics and bugfix

Open movence opened this issue 1 year ago • 0 comments

Description of changes

This PR includes 1 bug fix and and enhancement to add NVIDIA GPU count metrics including _limit, _request and _total at pod, node and cluster levels

GPU count metrics
- Add NVIDIA GPU count metric replication logic to metricstrasformprocessor translator
- Update AWS EMF exporter metric declarations to include GPU count metrics with dimensions
Bug fix
- Fix the bug where the agent emits pod GPU metrics when there is no active workload

Related PR in contrib: https://github.com/amazon-contributing/opentelemetry-collector-contrib/pull/214

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Tested with a cluster with 2 GPU (g4dn.12xlarge) instances with a workload which requires 2 GPU devices out of 8 total. As shown in the graph.

Bug fix
- Metrics showing wrong average pod_gpu_utilization of 25% when there is only 1 active workload using 2 GPU devices out of 8 total. This is because unused GPU devices still emit 0 utilization data.
- Metrics showing 100% with the fix.
GPU count metrics

Requirements

Before commit the code, please do the following steps.

Run make fmt and make fmt-sh
Run make lint

May 21 '24 15:05 movence