dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

Add cURL, wget or something similar for basic localhost URL checks that metrics are being produced.

Open hassanbabaie opened this issue 11 months ago • 1 comments

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

Please provide a clear description of the problem this feature solves

We've seen some issues with dcgm-exporter not exporting metrics but running without error. The fix in a number of cases has been to just restart that pod.

We plan to use a livenessProbe to cURL (or something similar) looking for DCGM_FI_DEV_GPU_UTIL however the limited image (as far as I can tell) does not have cURL or wget...

We could rebuild the image ourselves but I feel this would be really useful for other users of dcgm-exporter

Could we get added a binary to the image by default to support checking the exported HTML data?

For example users can then do something like:

livenessProbe:
  exec:
    command:
    - sh
    - -c
    - >-
      RESPONSE=$(curl localhost:9400/metrics  | grep 'DCGM_FI_DEV_GPU_UTIL{' | wc -l) |
      if [[ $RESPONSE -ge "1" ]]; then  exit 0; else exit 1; fi
  initialDelaySeconds: 5
  periodSeconds: 5

Feature Description

See above this is enabling a web url checking livenessProbe that does not required rebuilding of the docker image

Describe your ideal solution

Add cURL or wget to the published docker images

Additional context

livenessProbe:
  exec:
    command:
    - sh
    - -c
    - >-
      RESPONSE=$(curl localhost:9400/metrics  | grep 'DCGM_FI_DEV_GPU_UTIL{' | wc -l) |
      if [[ $RESPONSE -ge "1" ]]; then  exit 0; else exit 1; fi
  initialDelaySeconds: 5
  periodSeconds: 5

hassanbabaie avatar Feb 11 '25 21:02 hassanbabaie

I've never seen this. In what situations are you finding that it suddenly stops publishing metrics? That seems like the issue which needs to be investigated.

chipzoller avatar Feb 19 '25 21:02 chipzoller