amazon-cloudwatch-agent icon indicating copy to clipboard operation
amazon-cloudwatch-agent copied to clipboard

nvidia gpu count metrics and bugfix

Open movence opened this issue 1 year ago • 0 comments

Description of changes

This PR includes 1 bug fix and and enhancement to add NVIDIA GPU count metrics including _limit, _request and _total at pod, node and cluster levels

  • GPU count metrics
    • Add NVIDIA GPU count metric replication logic to metricstrasformprocessor translator
    • Update AWS EMF exporter metric declarations to include GPU count metrics with dimensions
  • Bug fix
    • Fix the bug where the agent emits pod GPU metrics when there is no active workload

Related PR in contrib: https://github.com/amazon-contributing/opentelemetry-collector-contrib/pull/214

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Tested with a cluster with 2 GPU (g4dn.12xlarge) instances with a workload which requires 2 GPU devices out of 8 total. As shown in the graph.

  • Bug fix

    • Metrics showing wrong average pod_gpu_utilization of 25% when there is only 1 active workload using 2 GPU devices out of 8 total. This is because unused GPU devices still emit 0 utilization data. Screenshot 2024-05-21 at 11 12 43 AM
    • Metrics showing 100% with the fix. Screenshot 2024-05-21 at 11 29 44 AM
  • GPU count metrics Screenshot 2024-05-20 at 12 11 21 PM

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

movence avatar May 21 '24 15:05 movence