amazon-cloudwatch-agent
amazon-cloudwatch-agent copied to clipboard
nvidia gpu count metrics and bugfix
Description of changes
This PR includes 1 bug fix and and enhancement to add NVIDIA GPU count metrics including _limit, _request and _total at pod, node and cluster levels
- GPU count metrics
- Add NVIDIA GPU count metric replication logic to
metricstrasformprocessortranslator - Update AWS EMF exporter metric declarations to include GPU count metrics with dimensions
- Add NVIDIA GPU count metric replication logic to
- Bug fix
- Fix the bug where the agent emits pod GPU metrics when there is no active workload
Related PR in contrib: https://github.com/amazon-contributing/opentelemetry-collector-contrib/pull/214
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Tested with a cluster with 2 GPU (g4dn.12xlarge) instances with a workload which requires 2 GPU devices out of 8 total. As shown in the graph.
-
Bug fix
- Metrics showing wrong average
pod_gpu_utilizationof 25% when there is only 1 active workload using 2 GPU devices out of 8 total. This is because unused GPU devices still emit 0 utilization data. - Metrics showing 100% with the fix.
- Metrics showing wrong average
-
GPU count metrics
Requirements
Before commit the code, please do the following steps.
- Run
make fmtandmake fmt-sh - Run
make lint