INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded
Hello! I have built dcgm-exporter from source with
git clone https://github.com/NVIDIA/dcgm-exporter.git
cd dcgm-exporter
make binary
Then, I have created a custom metrics file with
cat << EOT > dcp-metrics-custom.csv
DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active.
DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active.
EOT
And finally started dcgm-exporter with the custom metrics
sudo cmd/dcgm-exporter/dcgm-exporter -c 500 -f dcp-metrics-custom.csv
This gives me
2024/10/09 11:10:23 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded
INFO[0000] Falling back to metric file 'dcp-metrics-custom.csv'
WARN[0000] Skipping line 0 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled
WARN[0000] Skipping line 1 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled
WARN[0000] Skipping line 2 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled
WARN[0000] Skipping line 3 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled
INFO[0000] Not collecting GPU metrics; no fields to watch for device type: 1
INFO[0000] Not collecting NvSwitch metrics; no fields to watch for device type: 3
INFO[0000] Not collecting NvLink metrics; no fields to watch for device type: 6
INFO[0000] Not collecting CPU metrics; no fields to watch for device type: 7
INFO[0000] Not collecting CPU Core metrics; no fields to watch for device type: 8
INFO[0000] Pipeline starting
INFO[0000] Starting webserver
INFO[0000] Listening on address="[::]:9400"
INFO[0000] TLS is disabled. address="[::]:9400" http2=false
Watching at http://localhost:9400/metrics does not show any metrics, so I assume they are not collected (and/or not enabled), which is actually stated in the dcgm-exporter logs.
I have also tried using the latest dcgm-exporter docker images (nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04 - latest and nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04 - that matches my driver that ships with CUDA 12.2) with
docker run --gpus all -v ./custom_metrics/dcp-metrics-custom.csv:/etc/dcgm-exporter/custom_metrics/dcp-metrics-custom.csv --net host --cap-add SYS_ADMIN --privileged nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04 -f /etc/dcgm-exporter/custom_metrics/dcp-metrics-custom.csv
But it gives me the same output
2024/10/09 11:51:23 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
time="2024-10-09T11:51:23Z" level=info msg="Starting dcgm-exporter"
time="2024-10-09T11:51:23Z" level=info msg="DCGM successfully initialized!"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-10-09T11:51:24Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/custom_metrics/dcp-metrics-custom.csv'"
time="2024-10-09T11:51:24Z" level=warning msg="Skipping line 0 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled"
time="2024-10-09T11:51:24Z" level=warning msg="Skipping line 1 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-10-09T11:51:24Z" level=warning msg="Skipping line 2 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled"
time="2024-10-09T11:51:24Z" level=warning msg="Skipping line 3 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting GPU metrics; no fields to watch for device type: 1"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-10-09T11:51:24Z" level=info msg="Pipeline starting"
time="2024-10-09T11:51:24Z" level=info msg="Starting webserver"
time="2024-10-09T11:51:24Z" level=info msg="Listening on" address="[::]:9400"
time="2024-10-09T11:51:24Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
How should I deal with this issue? And how do I enable these metrics?
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 On | N/A |
|100% 91C P2 141W / 170W | 4675MiB / 12288MiB | 89% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2436 G /usr/lib/xorg/Xorg 1216MiB |
| 0 N/A N/A 2729 G /usr/bin/gnome-shell 169MiB |
| 0 N/A N/A 4459 G ...Telegram/Telegram 2MiB |
| 0 N/A N/A 5432 G ...ures=SpareRendererForSitePerProcess 131MiB |
| 0 N/A N/A 8298 G ...seed-version=20241008-180117.502000 523MiB |
| 0 N/A N/A 55717 G ...ures=SpareRendererForSitePerProcess 68MiB |
| 0 N/A N/A 71215 G ...erProcess --variations-seed-version 60MiB |
| 0 N/A N/A 591055 C /prog 2478MiB |
+---------------------------------------------------------------------------------------+
$ dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==================================================+
| Module ID | Name | State |
+-----------+--------------------+--------------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Not loaded |
| 6 | Config | Not loaded |
| 7 | Diag | Not loaded |
| 8 | Profiling | Not loaded |
| 9 | SysMon | Not loaded |
+-----------+--------------------+--------------------------------------------------+
$ sudo nv-hostengine -f host.log --log-level debug
Err: Failed to start DCGM Server: -7
I just found out that dcgm does not support GTX/RTX gpus, unfortunately, as pointed out by this comment. It would be really useful to add this to documentation, as I can easily build a cloud with GTX/RTX gpus.
Is there a similar tool that does the same thing for GTX/RTX? Except of course profiling with nsys/ncu.
I just want to monitor the SM occupancy rates at every point of time without interfering with the running programs
I also encountered the same problem, how to solve it? Driver Version: 525.85.12 exporter-image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04
nvidia-smi
Sat Oct 12 16:06:30 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
logs:
time="2024-10-12T07:02:49Z" level=info msg="Starting dcgm-exporter"
time="2024-10-12T07:02:49Z" level=info msg="DCGM successfully initialized!"
time="2024-10-12T07:02:49Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-10-12T07:02:49Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_NVLINK_RX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_NVLINK_TX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 28 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 29 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 30 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 31 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 32 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 33 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 34 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 35 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 36 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=info msg="Initializing system entities of type: GPU"
time="2024-10-12T07:02:55Z" level=info msg="Initializing system entities of type: NvSwitch"
time="2024-10-12T07:02:55Z" level=info msg="Not collecting switch metrics: no switches to monitor"
time="2024-10-12T07:02:55Z" level=info msg="Initializing system entities of type: NvLink"
time="2024-10-12T07:02:55Z" level=info msg="Not collecting link metrics: no switches to monitor"
time="2024-10-12T07:02:55Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-10-12T07:02:55Z" level=info msg="Pipeline starting"
time="2024-10-12T07:02:55Z" level=info msg="Starting webserver"
Hi, have you fixed it? Thanks!
Hello! No, there is no way to fix it for customer grade GPUs. This tool is built specifically for cloud GPUs, unfortunately. Hopefully, Nvidia will add a similar tool for customer grade GPUs in future
I see, thank you!