DCGM
DCGM copied to clipboard
No NVLINK activity on DGX-A100 320GB
We use dcgm-exporter 3.3.3-3.3.0, nv-hostengine & dcgmi 3.3.3, nvidia drivers 535.154.05, DGXOS6 on DGX-A100 320GB. The csv contains
DCGM_FI_PROF_NVLINK_TX_BYTES, gauge, The rate of data not including protocol headers transmitted over NVLink (in B/s).
DCGM_FI_PROF_NVLINK_RX_BYTES, gauge, The rate of data not including protocol headers received over NVLink (in B/s).
however, the exporter always returns 0:
# curl -s localhost:9400/metrics | grep NVLINK
# HELP DCGM_FI_PROF_NVLINK_TX_BYTES The rate of data not including protocol headers transmitted over NVLink (in B/s).
# TYPE DCGM_FI_PROF_NVLINK_TX_BYTES gauge
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="0",UUID="GPU-715daa1d-db6f-9e69-ab48-190158bd5360",device="nvidia0",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="1",UUID="GPU-02348a17-a825-300c-0336-48e33d0dadb2",device="nvidia1",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="2",UUID="GPU-fbd9a227-e473-b993-215f-8f39b3574fd0",device="nvidia2",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="3",UUID="GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e",device="nvidia3",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="4",UUID="GPU-2a15688f-4b5f-999c-48dc-e9ec78b78531",device="nvidia4",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="5",UUID="GPU-995a8ef3-32b6-2e07-be4f-ac9d0371a7f1",device="nvidia5",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="6",UUID="GPU-88981248-fa05-f000-d761-05c8de30c8c6",device="nvidia6",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="7",UUID="GPU-f7bbbbcd-f23c-ad4f-f27b-043995ee3fb8",device="nvidia7",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
# HELP DCGM_FI_PROF_NVLINK_RX_BYTES The rate of data not including protocol headers received over NVLink (in B/s).
# TYPE DCGM_FI_PROF_NVLINK_RX_BYTES gauge
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="0",UUID="GPU-715daa1d-db6f-9e69-ab48-190158bd5360",device="nvidia0",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="1",UUID="GPU-02348a17-a825-300c-0336-48e33d0dadb2",device="nvidia1",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="2",UUID="GPU-fbd9a227-e473-b993-215f-8f39b3574fd0",device="nvidia2",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="3",UUID="GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e",device="nvidia3",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="4",UUID="GPU-2a15688f-4b5f-999c-48dc-e9ec78b78531",device="nvidia4",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="5",UUID="GPU-995a8ef3-32b6-2e07-be4f-ac9d0371a7f1",device="nvidia5",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="6",UUID="GPU-88981248-fa05-f000-d761-05c8de30c8c6",device="nvidia6",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="7",UUID="GPU-f7bbbbcd-f23c-ad4f-f27b-043995ee3fb8",device="nvidia7",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
dcgm as well
# dcgmi dmon -e 1011,1012
#Entity NVLTX NVLRX
ID
GPU 7 0 0
GPU 6 0 0
GPU 5 0 0
GPU 4 0 0
GPU 3 0 0
GPU 2 0 0
GPU 1 0 0
GPU 0 0 0
GPU 7 0 0
GPU 6 0 0
GPU 5 0 0
GPU 4 0 0
GPU 3 0 0
GPU 2 0 0
GPU 1 0 0
GPU 0 0 0
GPU 7 0 0
GPU 6 0 0
GPU 5 0 0
GPU 4 0 0
GPU 3 0 0
GPU 2 0 0
GPU 1 0 0
GPU 0 0 0
GPU 7 0 0
GPU 6 0 0
GPU 5 0 0
GPU 4 0 0
GPU 3 0 0
GPU 2 0 0
GPU 1 0 0
GPU 0 0 0
At least on 18.1.2024 the data used to be there. Since then, there have been a couple of updates to packages, drivers, ..., and dcgm.
The system has the latest DGXOS 6.1, latest fw and all ubuntu updates applied.
In nv-hostengine.log I see the following errors
2024-02-01 14:12:01.205 ERROR [9596:9598] [[SysMon]] Couldn't open CPU Vendor info file file '/sys/devices/soc0/soc_id' for reading: '@^V���^?' [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/sysmon/DcgmSystemMonitor.cpp:261] [DcgmSystemMonitor::ReadCpuVendorAndModel]
2024-02-01 14:12:01.207 ERROR [9596:9598] [[SysMon]] A runtime exception occured when creating module. Ex: Incompatible hardware vendor for sysmon. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-01 14:12:01.207 ERROR [9596:9598] Failed to load module 9 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1831] [DcgmHostEngineHandler::LoadModule]
2024-02-01 14:26:13.984 ERROR [9694:9696] [[SysMon]] Couldn't open CPU Vendor info file file '/sys/devices/soc0/soc_id' for reading: '@F��H^?' [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/sysmon/DcgmSystemMonitor.cpp:261] [DcgmSystemMonitor::ReadCpuVendorAndModel]
2024-02-01 14:26:13.986 ERROR [9694:9696] [[SysMon]] A runtime exception occured when creating module. Ex: Incompatible hardware vendor for sysmon. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-01 14:26:13.986 ERROR [9694:9696] Failed to load module 9 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1831] [DcgmHostEngineHandler::LoadModule]
2024-02-01 14:26:27.640 ERROR [9694:14170] [[NvSwitch]] NSCQ field Id 0 passed error -5 for device 0x9c9a90 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:1102] [DcgmNs::DcgmNvSwitchManager::ReadLinkStatesAllSwitches]
2024-02-01 14:26:27.640 ERROR [9694:14170] [[NvSwitch]] NSCQ field Id 0 passed error -5 for device 0x9c9a90 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:1102] [DcgmNs::DcgmNvSwitchManager::ReadLinkStatesAllSwitches]
2024-02-01 14:26:27.641 ERROR [9694:14170] [[NvSwitch]] NSCQ field Id 0 passed error -5 for device 0x9c9a90 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:1102] [DcgmNs::DcgmNvSwitchManager::ReadLinkStatesAllSwitches]
...
2024-02-01 14:26:27.736 ERROR [9694:14170] [[NvSwitch]] NSCQ field Id 0 passed error -5 for device 0x9c9b60 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:1102] [DcgmNs::DcgmNvSwitchManager::ReadLinkStatesAllSwitches]
2024-02-01 15:06:56.596 ERROR [9694:9695] Received This request is serviced by a module of DCGM that is not currently loaded [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:227] [DcgmHostEngineHandler::GetAllEntitiesOfEntityGroup]
2024-02-01 15:15:17.731 ERROR [9694:14440] [[Profiling]] FieldId {1040} is not supported for GPU 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2709] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ConvertFieldIdsToMetricIds]
2024-02-01 15:15:17.731 ERROR [9694:14440] [[Profiling]] Unable to reconfigure LOP metric watches for GpuId {0} [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2740] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ChangeWatchSt>
2024-02-01 15:15:17.819 ERROR [9694:9696] DCGM_PROFILING_SR_WATCH_FIELDS failed with -6 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3710] [DcgmHostEngineHandler::WatchFieldGroup]
From the logs I see that the DCGM_FI_PROF_NVLINK_L0_TX_BYTES (1040)
field was used instead of DCGM_FI_PROF_NVLINK_TX_BYTES (1011)
: [[Profiling]] FieldId {1040} is not supported for GPU 0
The DCGM_FI_PROF_NVLINK_L0_TX_BYTES is only supported on Hopper+ GPUs.
@nikkon-dev, I apologise, that was my bad; while preparing this issue, I ran (based on https://github.com/NVIDIA/DCGM/issues/119) the command
dcgmi dmon -d 100 -e 1040
and received the same output. Using -e 1011
and/or -e 1012
I receive data, but it is all the time 0 (which shouldn't be as I am running a dummy LLM training and that same training on an 8x A100 80GB PCIe + NVLink and DGX-H100, all with the same dcgm-exporter setup, shows NVLink massively in use).
This is what I see in the dcgm-exporter container logs
# docker logs docker.dcgm-exporter.service
time="2024-02-02T07:56:31Z" level=info msg="Starting dcgm-exporter"
time="2024-02-02T07:56:31Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
time="2024-02-02T07:56:31Z" level=info msg="DCGM successfully initialized!"
time="2024-02-02T07:56:31Z" level=info msg="Collecting DCP Metrics"
time="2024-02-02T07:56:31Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2024-02-02T07:56:31Z" level=info msg="Initializing system entities of type: GPU"
time="2024-02-02T07:56:33Z" level=info msg="Initializing system entities of type: NvSwitch"
time="2024-02-02T07:56:33Z" level=info msg="Not collecting switch metrics: No fields to watch for device type: 3"
time="2024-02-02T07:56:33Z" level=info msg="Initializing system entities of type: NvLink"
time="2024-02-02T07:56:33Z" level=info msg="Not collecting link metrics: No fields to watch for device type: 6"
time="2024-02-02T07:56:33Z" level=info msg="Initializing system entities of type: CPU"
time="2024-02-02T07:56:33Z" level=info msg="Not collecting cpu metrics: Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-02-02T07:56:33Z" level=info msg="Initializing system entities of type: CPU Core"
time="2024-02-02T07:56:33Z" level=info msg="Not collecting cpu core metrics: Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-02-02T07:56:33Z" level=info msg="Pipeline starting"
time="2024-02-02T07:56:33Z" level=info msg="Starting webserver"
level=info ts=2024-02-02T07:56:33.674Z caller=tls_config.go:313 msg="Listening on" address=[::]:9400
level=info ts=2024-02-02T07:56:33.674Z caller=tls_config.go:316 msg="TLS is disabled." http2=false address=[::]:9400
The command used to run it is
/usr/bin/docker run --rm --gpus all --net host --cap-add=SYS_ADMIN --cpus=0.5 --name docker.dcgm-exporter.service -p 9400:9400 -v "/opt/deepops/nvidia-dcgm-exporter/dcgm-custom-metrics.csv:/etc/dcgm-exporter/default-counters.csv" nvcr.io/nvidia/k8s/dcgm-exporter:3.3.3-3.3.0-ubuntu22.04 -r localhost:5555 -f /etc/dcgm-exporter/default-counters.csv
The file /etc/dcgm-exporter/default-counters.csv contains the DCGM_FI_PROF_NVLINK_TX_BYTES
and DCGM_FI_PROF_NVLINK_RX_BYTES
fields.
Let me know if you need me to collect more data.
I see you are running nv-hostengine on port 5555. Could you rerun it with -f host.debug.log --log-level debug
arguments and provide the host.debug.log after the dcgm-exporter starts reporting metrics or after dcgmi dmon -e 1011
command?
Could you also provide topology output from nvidia-smi?
nvidia-smi topology
# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5 N/A
NIC0 PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS
NIC2 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS
NIC3 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS
NIC7 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS
NIC8 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
NIC9 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
The host.debug.log is attached. I obtained it by stopping the dcgm-exporter service, stopping nvidia-dcgm service, added the arguments to nv-hostengine, restarted the nvidia-dcgm service, restarted the dcgm-exporter service and then also ran the command in cli.
Weird. Post second restart of both services I've noticed that it started working again. So I did the following, modified the nv-hostengine service to include the log, rebooted the system, started the job (NVLINK data from dcgm-exporter is all 0, dcgmi ran from the cli shows all 0). This is in log boot_host.debug.log.zip
.
boot_host.debug.log.zip
Then I stopped both services and restarted them again, NVLINK data started being collected correctly both dcgm-exporter and dcgmi ran from the cli returned values other than 0. This is in log restart_host.debug.log.zip
restart_host.debug.log.zip
What could be the cause (start-up order?) and how to resolve it?