DCGM
DCGM copied to clipboard
dcgm-exporter crashes hostengine.
Running a 3.3.5-3.4.0
exporter on a 3.3.5 host-engine as shipped via nvidia-ubuntu-repos SEGFAULTs the Host-engine.
Is there something I can do? Shour that be reported to the exporter instead?
Logs:
dmesg crash info
Feb28 16:22] nvidia-nvswitch5: open (major=510)
[ +0,042810] nvidia-nvswitch4: open (major=510)
[ +0,042606] nvidia-nvswitch0: open (major=510)
[ +0,042409] nvidia-nvswitch2: open (major=510)
[ +0,042448] nvidia-nvswitch1: open (major=510)
[ +0,042372] nvidia-nvswitch3: open (major=510)
[Feb28 16:29] nv-hostengine[1280071]: segfault at 28 ip 00007f09f65c74b2 sp 00007f09f61e2ba0 error 6 in libdcgmmodulenvswitch.so.3.3.5[7f09f658c000+f8000]
[ +0,000008] Code: 7d b8 44 88 6d b0 e8 7d 0a ff ff 48 8b 45 a8 48 8b 73 18 48 89 45 c0 48 3b 73 20 0f 84 df 00 00 00 66 0f 6f 45 b0 48 83 c6 18 <0f> 11 46 e8 48 8b 45 c0 48 89 46 f8 48 89 73 18 48 8d 65 d8 5b 41
[ +0,155916] nvidia-nvswitch3: release (major=510)
[ +0,000005] nvidia-nvswitch1: release (major=510)
[ +0,000002] nvidia-nvswitch2: release (major=510)
[ +0,000003] nvidia-nvswitch0: release (major=510)
[ +0,000002] nvidia-nvswitch4: release (major=510)
[ +0,000002] nvidia-nvswitch5: release (major=510)
journal for exporter and hostengine
Feb 28 16:21:57 gx01 systemd[1]: Started NVIDIA DCGM service.
Feb 28 16:21:58 gx01 nv-hostengine[1280055]: DCGM initialized
Feb 28 16:21:58 gx01 nv-hostengine[1280055]: Started host engine version 3.3.5 using port number: 5555
Feb 28 16:29:08 gx01 systemd[1]: Started DCGM Exporter.
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Starting dcgm-exporter"
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="DCGM successfully initialized!"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Collecting DCP Metrics"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Falling back to metric file '/net/mgmtdelab/pool/html/dcgm/current/counters.csv'"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Initializing system entities of type: GPU"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvSwitch"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvLink"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU Core"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU Core metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="can not destroy group" error="Error destroying group: Host engine connection invalid/disconnected" groupID="{21}"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="Cannot destroy field group." error="Host engine connection invalid/disconnected"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=fatal msg="Failed to watch metrics: Error watching fields: Host engine connection invalid/disconnected"
Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'.
Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Main process exited, code=killed, status=11/SEGV
Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Failed with result 'signal'.
Versions
# dcgm-exporter -v --debug
DCGM Exporter version 3.3.5-3.4.0
# dcgmi -v
Version : 3.3.5
Build ID : 14
Build Date : 2024-02-24
Build Type : Release
Commit ID : 93088b0e1286c6e7723af1930251298870e26c19
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 08a0d9624b562a1342bf5f8828939294
apt-cache policy datacenter-gpu-manager
# apt-cache policy datacenter-gpu-manager
datacenter-gpu-manager:
Installed: 1:3.3.5
Candidate: 1:3.3.5
Version table:
*** 1:3.3.5 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
100 /var/lib/dpkg/status
1:3.3.3 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.3.1 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.3.0 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.2.6 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.2.5 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.2.3 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.1.8 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.1.7 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.1.6 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.1.3 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.0.4 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.4.8 600
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.4.7 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.4.6 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.4.5 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.3.6 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.3.5 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.3.4 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.3.2 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.3.1 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.2.9 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.2.8 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.2.3 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.1.8 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.1.7 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.1.4 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.0.15 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.0.14 600
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.0.13 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal/common amd64 Packages
OS info
# cat /etc/dgx-release
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2020-10-26-11-53-11"
DGX_SWBUILD_VERSION="5.0.0"
DGX_COMMIT_ID="7501dff"
DGX_PLATFORM="DGX Server for DGX A100"
DGX_SERIAL_NUMBER="XXXXXXXXXXXX"
DGX_OTA_VERSION="5.0.5"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"
DGX_OTA_VERSION="5.1.1"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"
DGX_OTA_VERSION="5.2.0"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"
DGX_OTA_VERSION="5.3.1"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"
DGX_OTA_VERSION="5.5.1"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"
Hi @krono, Thank you for the report. Is the issue easily reproducible? Would it be possible to request nv-hostengine core dump?
EDIT: follow up questions Do you get any syslog kernel error messages for NVLink in 16:21 - 16:29 timeframe?
Hi @superg (somehow I don't get gh mails anymore, sorry)
kernel syslog messages in timeframe
root@gx01:/var/log# grep '^Feb 28 16:[23]' syslog.1
Feb 28 16:20:50 gx01 kernel: [103278.185814] audit: type=1400 audit(1709133650.187:1080): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1278914/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:21:20 gx01 kernel: [103308.432689] audit: type=1400 audit(1709133680.436:1081): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1279373/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:21:30 gx01 kernel: [103318.719235] audit: type=1400 audit(1709133690.720:1082): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1279512/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:21:31 gx01 kernel: [103319.282115] audit: type=1400 audit(1709133691.284:1083): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1279555/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:21:31 gx01 kernel: [103319.491742] audit: type=1400 audit(1709133691.492:1084): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1279656/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:21:57 gx01 systemd[1]: Started NVIDIA DCGM service.
Feb 28 16:21:57 gx01 kernel: [103345.680201] nvidia-nvswitch5: open (major=510)
Feb 28 16:21:57 gx01 kernel: [103345.723011] nvidia-nvswitch4: open (major=510)
Feb 28 16:21:57 gx01 kernel: [103345.765617] nvidia-nvswitch0: open (major=510)
Feb 28 16:21:57 gx01 kernel: [103345.808026] nvidia-nvswitch2: open (major=510)
Feb 28 16:21:57 gx01 kernel: [103345.850474] nvidia-nvswitch1: open (major=510)
Feb 28 16:21:57 gx01 kernel: [103345.892846] nvidia-nvswitch3: open (major=510)
Feb 28 16:21:58 gx01 nv-hostengine: DCGM initialized
Feb 28 16:21:58 gx01 nv-hostengine[1280055]: Started host engine version 3.3.5 using port number: 5555
Feb 28 16:22:03 gx01 kernel: [103351.124025] audit: type=1400 audit(1709133723.125:1085): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1280110/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:22:32 gx01 systemd[1]: Started DCGM Exporter.
Feb 28 16:22:32 gx01 kernel: [103380.216261] audit: type=1400 audit(1709133752.217:1086): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/28026/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:22:32 gx01 dcgm-exporter[1280577]: /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter)
Feb 28 16:22:32 gx01 dcgm-exporter[1280577]: /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter)
Feb 28 16:22:32 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 16:22:32 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'.
Feb 28 16:23:02 gx01 systemd[1]: dcgm-exporter.service: Scheduled restart job, restart counter is at 3.
Feb 28 16:23:02 gx01 systemd[1]: Stopped DCGM Exporter.
Feb 28 16:23:02 gx01 systemd[1]: Started DCGM Exporter.
Feb 28 16:23:02 gx01 dcgm-exporter[1280963]: /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter)
Feb 28 16:23:02 gx01 dcgm-exporter[1280963]: /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter)
Feb 28 16:23:02 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 16:23:02 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'.
Feb 28 16:23:07 gx01 systemd[1]: Stopped DCGM Exporter.
Feb 28 16:24:11 gx01 kernel: [103479.881565] audit: type=1400 audit(1709133851.887:1087): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281838/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:24:12 gx01 kernel: [103480.388663] audit: type=1400 audit(1709133852.395:1088): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281885/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:24:12 gx01 kernel: [103480.539563] audit: type=1400 audit(1709133852.543:1089): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281908/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:24:13 gx01 kernel: [103481.137739] audit: type=1400 audit(1709133853.143:1090): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281946/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:24:13 gx01 kernel: [103481.651807] audit: type=1400 audit(1709133853.655:1091): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281992/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:24:13 gx01 kernel: [103481.804767] audit: type=1400 audit(1709133853.811:1092): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1282016/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:24:54 gx01 systemd[1]: tmp-.esp_tmp-nvme1n1p1.mount: Succeeded.
Feb 28 16:24:54 gx01 systemd[1]: tmp-.esp_tmp-nvme2n1p1.mount: Succeeded.
Feb 28 16:25:01 gx01 kernel: [103529.974717] audit: type=1400 audit(1709133901.980:1093): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1282647/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:25:01 gx01 CRON[1282648]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 28 16:25:01 gx01 kernel: [103529.976017] audit: type=1400 audit(1709133901.984:1094): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1282648/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:25:52 gx01 kernel: [103580.650881] audit: type=1400 audit(1709133952.657:1095): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1284754/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:26:22 gx01 kernel: [103610.898676] audit: type=1400 audit(1709133982.906:1096): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1285172/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:28:11 gx01 kernel: [103719.026823] audit: type=1400 audit(1709134091.036:1097): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1297267/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:28:11 gx01 kernel: [103719.579399] audit: type=1400 audit(1709134091.588:1098): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1297309/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:28:11 gx01 kernel: [103719.755666] audit: type=1400 audit(1709134091.764:1099): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1297333/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:29:08 gx01 systemd[1]: Started DCGM Exporter.
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Starting dcgm-exporter"
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="DCGM successfully initialized!"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Collecting DCP Metrics"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Falling back to metric file '/net/mgmtdelab/pool/html/dcgm/current/counters.csv'"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Initializing system entities of type: GPU"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvSwitch"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvLink"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU Core"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU Core metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
Feb 28 16:29:13 gx01 kernel: [103781.951727] nv-hostengine[1280071]: segfault at 28 ip 00007f09f65c74b2 sp 00007f09f61e2ba0 error 6 in libdcgmmodulenvswitch.so.3.3.5[7f09f658c000+f8000]
Feb 28 16:29:13 gx01 kernel: [103781.951735] Code: 7d b8 44 88 6d b0 e8 7d 0a ff ff 48 8b 45 a8 48 8b 73 18 48 89 45 c0 48 3b 73 20 0f 84 df 00 00 00 66 0f 6f 45 b0 48 83 c6 18 <0f> 11 46 e8 48 8b 45 c0 48 89 46 f8 48 89 73 18 48 8d 65 d8 5b 41
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="can not destroy group" error="Error destroying group: Host engine connection invalid/disconnected" groupID="{21}"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="Cannot destroy field group." error="Host engine connection invalid/disconnected"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=fatal msg="Failed to watch metrics: Error watching fields: Host engine connection invalid/disconnected"
Feb 28 16:29:14 gx01 kernel: [103782.107651] nvidia-nvswitch3: release (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.107656] nvidia-nvswitch1: release (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.107658] nvidia-nvswitch2: release (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.107661] nvidia-nvswitch0: release (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.107663] nvidia-nvswitch4: release (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.107665] nvidia-nvswitch5: release (major=510)
Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'.
Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Main process exited, code=killed, status=11/SEGV
Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Failed with result 'signal'.
Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Scheduled restart job, restart counter is at 1.
Feb 28 16:29:14 gx01 systemd[1]: Stopped NVIDIA DCGM service.
Feb 28 16:29:14 gx01 systemd[1]: Started NVIDIA DCGM service.
Feb 28 16:29:14 gx01 kernel: [103782.832440] nvidia-nvswitch5: open (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.875110] nvidia-nvswitch4: open (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.918412] nvidia-nvswitch0: open (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.961611] nvidia-nvswitch2: open (major=510)
Feb 28 16:29:15 gx01 kernel: [103783.004035] nvidia-nvswitch1: open (major=510)
Feb 28 16:29:15 gx01 kernel: [103783.046633] nvidia-nvswitch3: open (major=510)
Feb 28 16:29:15 gx01 nv-hostengine: DCGM initialized
Feb 28 16:29:15 gx01 nv-hostengine[1298239]: Started host engine version 3.3.5 using port number: 5555
Feb 28 16:29:24 gx01 systemd[1]: Stopped DCGM Exporter.
Feb 28 16:29:24 gx01 systemd[1]: Stopping NVIDIA DCGM service...
Feb 28 16:29:24 gx01 kernel: [103792.219790] nvidia-nvswitch3: release (major=510)
Feb 28 16:29:24 gx01 kernel: [103792.219989] nvidia-nvswitch1: release (major=510)
Feb 28 16:29:24 gx01 kernel: [103792.220182] nvidia-nvswitch2: release (major=510)
Feb 28 16:29:24 gx01 kernel: [103792.220374] nvidia-nvswitch0: release (major=510)
Feb 28 16:29:24 gx01 kernel: [103792.220575] nvidia-nvswitch4: release (major=510)
Feb 28 16:29:24 gx01 kernel: [103792.220761] nvidia-nvswitch5: release (major=510)
Feb 28 16:29:24 gx01 systemd[1]: nvidia-dcgm.service: Succeeded.
Feb 28 16:29:24 gx01 systemd[1]: Stopped NVIDIA DCGM service.
Feb 28 16:29:32 gx01 slurmd[27067]: slurmd: launch task StepId=838896.12 request from UID:12211 GID:5101 HOST:172.20.26.64 PORT:39002
Feb 28 16:29:32 gx01 kernel: [103800.729913] audit: type=1400 audit(1709134172.738:1100): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/27067/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:29:32 gx01 slurmd[27067]: slurmd: task/affinity: lllp_distribution: JobId=838896 implicit auto binding: cores, dist 1
Feb 28 16:29:32 gx01 slurmd[27067]: slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
Feb 28 16:29:32 gx01 slurmd[27067]: slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [838896]: mask_cpu, 0x0000000000000001000000000000000000000000000000010000000000000000
Feb 28 16:29:54 gx01 systemd[1]: tmp-.esp_tmp-nvme1n1p1.mount: Succeeded.
Feb 28 16:29:54 gx01 systemd[1]: tmp-.esp_tmp-nvme2n1p1.mount: Succeeded.
Feb 28 16:30:55 gx01 kernel: [103883.096874] audit: type=1400 audit(1709134255.111:1101): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1299484/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:31:25 gx01 kernel: [103913.358790] audit: type=1400 audit(1709134285.372:1102): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1299958/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:32:28 gx01 kernel: [103976.699653] audit: type=1400 audit(1709134348.713:1103): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/28026/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:34:54 gx01 systemd[1]: tmp-.esp_tmp-nvme1n1p1.mount: Succeeded.
Feb 28 16:34:54 gx01 systemd[1]: tmp-.esp_tmp-nvme2n1p1.mount: Succeeded.
Feb 28 16:35:01 gx01 CRON[1302542]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 28 16:35:01 gx01 kernel: [104129.968910] audit: type=1400 audit(1709134501.985:1104): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1302541/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:35:01 gx01 kernel: [104129.970106] audit: type=1400 audit(1709134501.985:1105): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1302542/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:35:57 gx01 kernel: [104185.572325] audit: type=1400 audit(1709134557.590:1106): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1303272/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:36:27 gx01 kernel: [104215.823679] audit: type=1400 audit(1709134587.842:1107): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1303523/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:39:54 gx01 systemd[1]: tmp-.esp_tmp-nvme1n1p1.mount: Succeeded.
Feb 28 16:39:54 gx01 systemd[1]: tmp-.esp_tmp-nvme2n1p1.mount: Succeeded.
root@gx01:/var/log#
Reproducible? for me everytime. Coredump:
coredumpctl
# coredumpctl dump nv-hostengine >nv-hostengine.core
PID: 3124092 (nv-hostengine)
UID: 0 (root)
GID: 0 (root)
Signal: 11 (SEGV)
Timestamp: Fri 2024-03-01 10:07:47 CET (59s ago)
Command Line: /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
Executable: /usr/bin/nv-hostengine
Control Group: /system.slice/nvidia-dcgm.service
Unit: nvidia-dcgm.service
Slice: system.slice
Boot ID: 35a3b73c95c04716880c91638ed46a93
Machine ID: 5b05d12040d24d8e9c8d38117ab12eba
Hostname: gx01
Storage: /var/lib/systemd/coredump/core.nv-hostengine.0.35a3b73c95c04716880c91638ed46a93.3124092.1709284067000000000000.lz4
Message: Process 3124092 (nv-hostengine) of user 0 dumped core.
Stack trace of thread 3124104:
#0 0x00007fc97150a4b2 n/a (libdcgmmodulenvswitch.so.3 + 0x594b2)
#1 0x00007fc9711c5456 n/a (libnvidia-nscq.so.2 + 0x9d456)
#2 0x00007fc9711760a3 n/a (libnvidia-nscq.so.2 + 0x4e0a3)
#3 0x00007fc9711756bf n/a (libnvidia-nscq.so.2 + 0x4d6bf)
#4 0x00007fc97117eb80 nscq_session_path_observe (libnvidia-nscq.so.2 + 0x56b80)
#5 0x00007fc9715530e7 n/a (libdcgmmodulenvswitch.so.3 + 0xa20e7)
#6 0x00007fc97152038f n/a (libdcgmmodulenvswitch.so.3 + 0x6f38f)
#7 0x00007fc9714eb19d n/a (libdcgmmodulenvswitch.so.3 + 0x3a19d)
#8 0x00007fc9714d6e9f n/a (libdcgmmodulenvswitch.so.3 + 0x25e9f)
#9 0x00007fc9714d7834 n/a (libdcgmmodulenvswitch.so.3 + 0x26834)
#10 0x00007fc9714dafd8 n/a (libdcgmmodulenvswitch.so.3 + 0x29fd8)
#11 0x00007fc9714dc4a4 n/a (libdcgmmodulenvswitch.so.3 + 0x2b4a4)
#12 0x00007fc9714e43b6 n/a (libdcgmmodulenvswitch.so.3 + 0x333b6)
#13 0x00007fc9714dabb1 n/a (libdcgmmodulenvswitch.so.3 + 0x29bb1)
#14 0x00007fc971561e5b n/a (libdcgmmodulenvswitch.so.3 + 0xb0e5b)
#15 0x00007fc9715623a9 n/a (libdcgmmodulenvswitch.so.3 + 0xb13a9)
#16 0x00007fc97417a609 start_thread (libpthread.so.0 + 0x8609)
#17 0x00007fc973f2f353 __clone (libc.so.6 + 0x11f353)
Stack trace of thread 3124101:
#0 0x00007fc973f2895d syscall (libc.so.6 + 0x11895d)
#1 0x00007fc971598791 n/a (libdcgmmodulenvswitch.so.3 + 0xe7791)
#2 0x00007fc9714d8a74 n/a (libdcgmmodulenvswitch.so.3 + 0x27a74)
#3 0x00007fc97152a408 n/a (libdcgmmodulenvswitch.so.3 + 0x79408)
#4 0x00007fc974201343 n/a (libdcgm.so.3 + 0x6c343)
#5 0x00007fc9742c742e n/a (libdcgm.so.3 + 0x13242e)
#6 0x00007fc974326025 n/a (libdcgm.so.3 + 0x191025)
#7 0x00007fc974334e9c n/a (libdcgm.so.3 + 0x19fe9c)
#8 0x00007fc974321438 n/a (libdcgm.so.3 + 0x18c438)
#9 0x00007fc9742c9aa7 n/a (libdcgm.so.3 + 0x134aa7)
#10 0x00007fc9742c9d2d n/a (libdcgm.so.3 + 0x134d2d)
#11 0x00007fc9742c9f1f n/a (libdcgm.so.3 + 0x134f1f)
#12 0x00007fc9742a71fe n/a (libdcgm.so.3 + 0x1121fe)
#13 0x00007fc974348871 n/a (libdcgm.so.3 + 0x1b3871)
#14 0x00007fc974352ad8 n/a (libdcgm.so.3 + 0x1bdad8)
#15 0x00007fc9743530a4 n/a (libdcgm.so.3 + 0x1be0a4)
#16 0x00007fc974231de6 n/a (libdcgm.so.3 + 0x9cde6)
#17 0x00007fc974356192 n/a (libdcgm.so.3 + 0x1c1192)
#18 0x00007fc9744111c8 n/a (libdcgm.so.3 + 0x27c1c8)
#19 0x00007fc97417a609 start_thread (libpthread.so.0 + 0x8609)
#20 0x00007fc973f2f353 __clone (libc.so.6 + 0x11f353)
Stack trace of thread 3124092:
#0 0x00007fc973eed23f clock_nanosleep (libc.so.6 + 0xdd23f)
#1 0x00007fc973ef2ec7 __nanosleep (libc.so.6 + 0xe2ec7)
#2 0x000000000040736b n/a (nv-hostengine + 0x736b)
#3 0x00007fc973e34083 __libc_start_main (libc.so.6 + 0x24083)
#4 0x00000000004079bc n/a (nv-hostengine + 0x79bc)
core.nv-hostengine.0.35a3b73c95c04716880c91638ed46a93.3124092.1709284067000000000000.lz4.zip
Thank you for the dumps, I am currently looking into it. Will share my findings here.
I narrowed the search down to one of the NSCQ observe callbacks in: https://github.com/NVIDIA/DCGM/blob/master/modules/nvswitch/DcgmNvSwitchManager.cpp#L851C45-L851C53
To understand more I would like to request debug level logs when the crash happen, here's how to do it:
Make sure nvidia-dcgm service is running (nv-hostengine), execute dcgmi set --logging-severity DEBUG
, that will set nv-hostengine logging level to DEBUG.
Next, reproduce the crash and share /var/nv-hostengine.log (feel free to clear it beforehand if needed).
Hi, here's the log nv-hostengine.log
I do not see the log point log_debug("Attaching to NvSwitches");
being hit…
Thank you for the logs. indeed it crashed in another NSCQ observe callback: https://github.com/NVIDIA/DCGM/blob/master/modules/nvswitch/FieldDefinitions.cpp#L164
I am currently looking into the chain of events that led into this, will reply once I have more information.
Unfortunately we aren't able to reproduce this issue internally. However we've added better debugging to help diagnose such issues in the future and at some point it will be merged to GitHub.
Unfortunately we aren't able to reproduce this issue internally. However we've added better debugging to help diagnose such issues in the future and at some point it will be merged to GitHub.
Is there any way i can debug that? Like a step through debugger?
Yes, basically you will have to build debug DCGM with symbols. Then you will be able to use GDB, step through code and inspect variables etc. Put a breakpoint here: https://github.com/NVIDIA/DCGM/blob/master/modules/nvswitch/FieldDefinitions.cpp#L164 Just want to mention that this is pretty advanced and involves using ./build.sh script and docker dcgmbuild container (our build is containerized) and running nv-hostengine locally.
So here we are.
Debuggin around this:
auto cb = [](const indexTypes... indicies,
nscq_rc_t rc,
TempData<nscqFieldType, storageType, is_vector, indexTypes...>::cbType in,
NscqDataCollector<TempData<nscqFieldType, storageType, is_vector, indexTypes...>> *dest) {
if (dest == nullptr)
{
log_error("NSCQ passed dest = nullptr");
return;
}
dest->callCounter++;
if (NSCQ_ERROR(rc))
{
log_error("NSCQ {} passed error {}", dest->nscqPath, (int)rc);
TempData<nscqFieldType, storageType, is_vector, indexTypes...> item;
item.CollectFunc(dest, indicies...);
return;
}
TempData<nscqFieldType, storageType, is_vector, indexTypes...> item; /* BREAKPOINT HERE */
item.CollectFunc(dest, in, indicies...);
};
shows:
Normal behavior for stuff like tempreatures or throughput:
gdb debug output for `*dest`: normal stuff
Thread 5 "nv-hostengine" hit Breakpoint 2, DcgmNs::DcgmNvSwitchManager::UpdateFields<nscq_link_throughput_t, DcgmNs::FieldIdStorageType<(unsigned short)862>, false, nscq_uuid_t*>(unsigned short, DcgmFvBuffer&, std::vector<dcgm_field_update_info_t, std::allocator<dcgm_field_update_info_t> > const&, long)::{lambda(nscq_uuid_t*, signed char, nscq_link_throughput_t, DcgmNs::NscqDataCollector<DcgmNs::TempData<nscq_link_throughput_t, DcgmNs::FieldIdStorageType<(unsigned short)862>, false, nscq_uuid_t*> >*)#1}::operator()(nscq_uuid_t*, signed char, nscq_link_throughput_t, DcgmNs::NscqDataCollector<DcgmNs::TempData<nscq_link_throughput_t, DcgmNs::FieldIdStorageType<(unsigned short)862>, false, nscq_uuid_t*> >*) const (__closure=0x0, indicies#0=0x564fc0, rc=0 '\000', in=..., dest=0x7ffff4afb290) at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:162
162 TempData<nscqFieldType, storageType, is_vector, indexTypes...> item;
$64 = {
callCounter = 6,
fieldId = 862,
nscqPath = 0x7ffff4fedcc0 <nscq_nvswitch_nvlink_throughput_counters> "/{nvswitch}/nvlink/throughput_counters",
data = std::vector of length 5, capacity 8 = {{
index = std::tuple containing = {
[1] = 0x564ef0
},
data = {
<DcgmNs::NvSwitch::Data::Uint64Data> = {
value = 0
},
members of DcgmNs::FieldIdStorageType<862>:
static fieldId = 862
}
}, {
index = std::tuple containing = {
[1] = 0x564d50
},
data = {
<DcgmNs::NvSwitch::Data::Uint64Data> = {
value = 0
},
members of DcgmNs::FieldIdStorageType<862>:
static fieldId = 862
}
}, {
index = std::tuple containing = {
[1] = 0x564e20
},
data = {
<DcgmNs::NvSwitch::Data::Uint64Data> = {
value = 0
},
members of DcgmNs::FieldIdStorageType<862>:
static fieldId = 862
}
}, {
index = std::tuple containing = {
[1] = 0x53df00
},
data = {
<DcgmNs::NvSwitch::Data::Uint64Data> = {
value = 0
},
members of DcgmNs::FieldIdStorageType<862>:
static fieldId = 862
}
}, {
index = std::tuple containing = {
[1] = 0x565090
},
data = {
<DcgmNs::NvSwitch::Data::Uint64Data> = {
value = 0
},
members of DcgmNs::FieldIdStorageType<862>:
static fieldId = 862
}
}}
}
This is more or less expected.
It seems something breaks for "physical id":
- We see the backtrace requests
"/{nvswitch}/id/phys_id"
gdb bt at that point for `phys id
(gdb) bt
#0 DcgmNs::DcgmNvSwitchManager::UpdateFields<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char>(unsigned short, DcgmFvBuffer&, std::vector<dcgm_field_update_info_t, std::allocator<dcgm_field_update_info_t> > const&, long)::{lambda(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*)#1}::operator()(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*) const (__closure=0x0, indicies#0=0x564ef0, indicies#1=0 '\000', rc=11 '\v', in=140737298543232, dest=0x716a20) at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:162
#1 0x00007ffff4eeb858 in DcgmNs::DcgmNvSwitchManager::UpdateFields<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char>(unsigned short, DcgmFvBuffer&, std::vector<dcgm_field_update_info_t, std::allocator<dcgm_field_update_info_t> > const&, long)::{lambda(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*)#1}::_FUN(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*) () at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:138
#2 0x00007ffff4b9b456 in ?? () from /lib/x86_64-linux-gnu/libnvidia-nscq.so.2
#3 0x00007ffff4b4c0a3 in ?? () from /lib/x86_64-linux-gnu/libnvidia-nscq.so.2
#4 0x00007ffff4b4b6bf in ?? () from /lib/x86_64-linux-gnu/libnvidia-nscq.so.2
#5 0x00007ffff4b54b80 in nscq_session_path_observe () from /lib/x86_64-linux-gnu/libnvidia-nscq.so.2
#6 0x00007ffff4f636ca in nscq_session_path_observe (session=0x7681b0, path=0x7ffff4fed8d0 <nscq_nvswitch_phys_id> "/{nvswitch}/id/phys_id", callback=0x7ffff4eeb813 <DcgmNs::DcgmNvSwitchManager::UpdateFields<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char>(unsigned short, DcgmFvBuffer&, std::vector<dcgm_field_update_info_t, std::allocator<dcgm_field_update_info_t> > const&, long)::{lambda(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*)#1}::_FUN(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*)>, data=0x7ffff4afb280, flags=0) at /srv/DCGM/sdk/nvidia/nscq/dlwrap/dlwrap.c:131
#7 0x00007ffff4eeb98d in DcgmNs::DcgmNvSwitchManager::UpdateFields<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> (this=0x53cb10, fieldId=863, buf=..., entities=std::vector of length 1, capacity 1 = {...}, now=1712069504430580) at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:167
#8 0x00007ffff4ec28a5 in DcgmNs::DcgmNvSwitchManager::UpdateFields (this=0x53cb10, nextUpdateTime=@0x7ffff4afb528: 1712069518487312) at /srv/DCGM/modules/nvswitch/DcgmNvSwitchManager.cpp:592
#9 0x00007ffff4ea9b6a in DcgmNs::DcgmModuleNvSwitch::RunOnce (this=0x53c970) at /srv/DCGM/modules/nvswitch/DcgmModuleNvSwitch.cpp:400
#10 0x00007ffff4ea9d6d in DcgmNs::DcgmModuleNvSwitch::TryRunOnce (this=0x53c970, forceRun=true) at /srv/DCGM/modules/nvswitch/DcgmModuleNvSwitch.cpp:419
#11 0x00007ffff4ea8428 in operator() (__closure=0x7fffd4036cf0) at /srv/DCGM/modules/nvswitch/DcgmModuleNvSwitch.cpp:273
#12 0x00007ffff4eaadae in std::__invoke_impl<void, DcgmNs::DcgmModuleNvSwitch::ProcessMessageFromTaskRunner(dcgm_module_command_header_t*)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:61
#13 0x00007ffff4eaabdb in std::__invoke_r<void, DcgmNs::DcgmModuleNvSwitch::ProcessMessageFromTaskRunner(dcgm_module_command_header_t*)::<lambda()>&>(struct {...} &) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:111
#14 0x00007ffff4eaa937 in std::_Function_handler<void(), DcgmNs::DcgmModuleNvSwitch::ProcessMessageFromTaskRunner(dcgm_module_command_header_t*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/std_function.h:291
#15 0x00007ffff4ebb3f4 in std::function<void ()>::operator()() const (this=0x7fffd4036cf0) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/std_function.h:560
#16 0x00007ffff4eb92ad in std::__invoke_impl<void, std::function<void ()> const&>(std::__invoke_other, std::function<void ()> const&) (__f=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:61
#17 0x00007ffff4eb646b in std::__invoke<std::function<void ()> const&>(std::function<void ()> const&) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:96
#18 0x00007ffff4eb1267 in std::invoke<std::function<void ()> const&>(std::function<void ()> const&) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/functional:97
#19 0x00007ffff4eada36 in DcgmNs::Task<void>::Task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<void ()>)::{lambda()#1}::operator()() const (__closure=0x7fffd4036cf0) at /srv/DCGM/common/Task.hpp:215
#20 0x00007ffff4ebb46a in std::__invoke_impl<int, DcgmNs::Task<void>::Task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<void ()>)::{lambda()#1}&>(std::__invoke_other, DcgmNs::Task<void>::Task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<void ()>)::{lambda()#1}&) (__f=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:61
#21 0x00007ffff4eb9474 in std::__invoke_r<std::optional<int>, DcgmNs::Task<void>::Task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<void ()>)::{lambda()#1}&>(std::optional<int>&&, (DcgmNs::Task<void>::Task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<void ()>)::{lambda()#1}&)...) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:114
#22 0x00007ffff4eb6538 in std::_Function_handler<std::optional<int> (), DcgmNs::Task<void>::Task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/std_function.h:291
#23 0x00007ffff4ec08da in std::function<std::optional<int> ()>::operator()() const (this=0x76bdf0) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/std_function.h:560
#24 0x00007ffff4ec0522 in std::__invoke_impl<std::optional<int>, std::function<std::optional<int> ()>&>(std::__invoke_other, std::function<std::optional<int> ()>&) (__f=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:61
#25 0x00007ffff4ec028e in std::__invoke<std::function<std::optional<int> ()>&>(std::function<std::optional<int> ()>&) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:96
#26 0x00007ffff4ebffd0 in std::invoke<std::function<std::optional<int> ()>&>(std::function<std::optional<int> ()>&) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/functional:97
#27 0x00007ffff4ebfc4f in DcgmNs::NamedBasicTask<int, void>::Run (this=0x76bde0) at /srv/DCGM/common/Task.hpp:155
#28 0x00007ffff4eaeca9 in DcgmNs::TaskRunner::Run (this=0x53ca58, oneIteration=true) at /srv/DCGM/common/TaskRunner.hpp:432
#29 0x00007ffff4ea9e2c in DcgmNs::DcgmModuleNvSwitch::run (this=0x53c970) at /srv/DCGM/modules/nvswitch/DcgmModuleNvSwitch.cpp:433
#30 0x00007ffff4f6bba4 in DcgmThread::RunInternal (this=0x53c9b8) at /srv/DCGM/common/DcgmThread/DcgmThread.cpp:308
#31 0x00007ffff4f6a7c5 in dcgmthread_starter (parm=0x53c9b8) at /srv/DCGM/common/DcgmThread/DcgmThread.cpp:34
#32 0x00007ffff7bfa609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#33 0x00007ffff79af353 in clone () from /lib/x86_64-linux-gnu/libc.so.6
- From that we would expect
fieldId
to be863
, but it is proabbly garbage: "32767"
gdb debug output for `*dest`: strange stuff
Thread 5 "nv-hostengine" hit Breakpoint 2, DcgmNs::DcgmNvSwitchManager::UpdateFields<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char>(unsigned short, DcgmFvBuffer&, std::vector<dcgm_field_update_info_t, std::allocator<dcgm_field_update_info_t> > const&, long)::{lambda(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*)#1}::operator()(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*) const (__closure=0x0, indicies#0=0x564ef0, indicies#1=0 '\000', rc=11 '\v', in=140737298543232, dest=0x716a20) at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:162
162 TempData<nscqFieldType, storageType, is_vector, indexTypes...> item;
$65 = {
callCounter = 4108821041,
fieldId = 32767,
nscqPath = 0x712960 "SWX-F8F7054E-5993-EB8D-786D-B59D5303DB16",
data = std::vector of length 0, capacity -1
}
The callCounter
looks goofy, too.
Most important, the nscqPath
is not the expected "/{nvswitch}/id/phys_id"
but rather the value?
-=-=-=-
It seem that there's something wrong in my /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.2
, because it looks like the library is just calling this with
broken info.
Lib info:
apt policy libnvidia-nscq-535
libnvidia-nscq-535:
Installed: 535.154.05-0ubuntu0.20.04.1
Candidate: 535.161.07-0ubuntu0.20.04.1
Version table:
535.161.08-1 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
535.161.07-1 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
535.161.07-0ubuntu0.20.04.1 600
500 http://de.archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 Packages
500 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages
535.154.05-1 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
*** 535.154.05-0ubuntu0.20.04.1 100
100 /var/lib/dpkg/status
535.129.03-1 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
535.104.12-1 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
535.104.05-1 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
535.86.10-1 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
535.54.03-1 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
So the promblem is probably not DCGM but rahter lib NSCQ?
Hi, thank you for the details! Let me process this information and get back to you.
@krono , I apologize for the long wait. We've managed to reproduce the issue on our side. While our call stack is different, the source of the problem is very likely to be the same and your observations on std::vector<> with garbage supports it. I believe that the fix that we're working on will resolve it.
thanks :)
Hi @superg , any news or any place I can read up on the issue here?
Hi @krono, We have an internal tracking ticket for this issue and an assigned developer, this is still work in progress.
The issue is with the callback signature (after all the template instantiations), we use:
void callback(const nscq_uuid_t* device, nscq_rc_t rc, std::vector<nscq_error_t>, void *data)
whereas NSCQ expects:
void callback(const nscq_uuid_t* device, nscq_rc_t rc, const nscq_error_t error, void* data)
for a given path type.
Callback code has to be rewritten for the second signature.
oh my.
Which component will need updat? DCGM or NSCQ?
That's in DCGM.
Thanks! I'll keep watching this space
I have now a second machine that fell victim to that problem: HGX-Bases system, similarly configured.
Hey, any news?
@krono, the issue is identified and we are working on a fix. The current ETA is August.
@superg Is this included in #189 or #180 ?
To answer my own question: NO.
@superg any news?
Hi @krono , I'm sorry I moved to another project some time ago, I will enquire.