DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

dcgm-exporter crashes hostengine.

Open krono opened this issue 11 months ago • 25 comments

Running a 3.3.5-3.4.0 exporter on a 3.3.5 host-engine as shipped via nvidia-ubuntu-repos SEGFAULTs the Host-engine.

Is there something I can do? Shour that be reported to the exporter instead?

Logs:

dmesg crash info
Feb28 16:22] nvidia-nvswitch5: open (major=510)
[  +0,042810] nvidia-nvswitch4: open (major=510)
[  +0,042606] nvidia-nvswitch0: open (major=510)
[  +0,042409] nvidia-nvswitch2: open (major=510)
[  +0,042448] nvidia-nvswitch1: open (major=510)
[  +0,042372] nvidia-nvswitch3: open (major=510)
[Feb28 16:29] nv-hostengine[1280071]: segfault at 28 ip 00007f09f65c74b2 sp 00007f09f61e2ba0 error 6 in libdcgmmodulenvswitch.so.3.3.5[7f09f658c000+f8000]
[  +0,000008] Code: 7d b8 44 88 6d b0 e8 7d 0a ff ff 48 8b 45 a8 48 8b 73 18 48 89 45 c0 48 3b 73 20 0f 84 df 00 00 00 66 0f 6f 45 b0 48 83 c6 18 <0f> 11 46 e8 48 8b 45 c0 48 89 46 f8 48 89 73 18 48 8d 65 d8 5b 41
[  +0,155916] nvidia-nvswitch3: release (major=510)
[  +0,000005] nvidia-nvswitch1: release (major=510)
[  +0,000002] nvidia-nvswitch2: release (major=510)
[  +0,000003] nvidia-nvswitch0: release (major=510)
[  +0,000002] nvidia-nvswitch4: release (major=510)
[  +0,000002] nvidia-nvswitch5: release (major=510)
journal for exporter and hostengine
Feb 28 16:21:57 gx01 systemd[1]: Started NVIDIA DCGM service.
Feb 28 16:21:58 gx01 nv-hostengine[1280055]: DCGM initialized
Feb 28 16:21:58 gx01 nv-hostengine[1280055]: Started host engine version 3.3.5 using port number: 5555
Feb 28 16:29:08 gx01 systemd[1]: Started DCGM Exporter.
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Starting dcgm-exporter"
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="DCGM successfully initialized!"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Collecting DCP Metrics"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Falling back to metric file '/net/mgmtdelab/pool/html/dcgm/current/counters.csv'"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Initializing system entities of type: GPU"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvSwitch"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvLink"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU Core"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU Core metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="can not destroy group" error="Error destroying group: Host engine connection invalid/disconnected" groupID="{21}"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="Cannot destroy field group." error="Host engine connection invalid/disconnected"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=fatal msg="Failed to watch metrics: Error watching fields: Host engine connection invalid/disconnected"
Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'.
Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Main process exited, code=killed, status=11/SEGV
Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Failed with result 'signal'.
Versions
# dcgm-exporter -v --debug
DCGM Exporter version 3.3.5-3.4.0
# dcgmi -v
Version : 3.3.5
Build ID : 14
Build Date : 2024-02-24
Build Type : Release
Commit ID : 93088b0e1286c6e7723af1930251298870e26c19
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 08a0d9624b562a1342bf5f8828939294
apt-cache policy datacenter-gpu-manager
# apt-cache policy datacenter-gpu-manager
datacenter-gpu-manager:
  Installed: 1:3.3.5
  Candidate: 1:3.3.5
  Version table:
 *** 1:3.3.5 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
        100 /var/lib/dpkg/status
     1:3.3.3 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:3.3.1 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:3.3.0 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:3.2.6 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:3.2.5 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:3.2.3 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:3.1.8 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:3.1.7 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:3.1.6 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:3.1.3 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:3.0.4 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:2.4.8 600
        600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
     1:2.4.7 600
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
        600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
     1:2.4.6 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:2.4.5 600
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
        600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
     1:2.3.6 600
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
        600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
     1:2.3.5 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:2.3.4 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:2.3.2 600
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
        600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
     1:2.3.1 600
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
        600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
     1:2.2.9 600
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
        600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
     1:2.2.8 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:2.2.3 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:2.1.8 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:2.1.7 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:2.1.4 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:2.0.15 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:2.0.14 600
        600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
     1:2.0.13 600
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
        600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal/common amd64 Packages
OS info
# cat /etc/dgx-release
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2020-10-26-11-53-11"
DGX_SWBUILD_VERSION="5.0.0"
DGX_COMMIT_ID="7501dff"
DGX_PLATFORM="DGX Server for DGX A100"
DGX_SERIAL_NUMBER="XXXXXXXXXXXX"

DGX_OTA_VERSION="5.0.5"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"

DGX_OTA_VERSION="5.1.1"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"

DGX_OTA_VERSION="5.2.0"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"

DGX_OTA_VERSION="5.3.1"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"

DGX_OTA_VERSION="5.5.1"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"

krono avatar Feb 28 '24 15:02 krono

Hi @krono, Thank you for the report. Is the issue easily reproducible? Would it be possible to request nv-hostengine core dump?

EDIT: follow up questions Do you get any syslog kernel error messages for NVLink in 16:21 - 16:29 timeframe?

superg avatar Feb 28 '24 16:02 superg

Hi @superg (somehow I don't get gh mails anymore, sorry)

kernel syslog messages in timeframe
root@gx01:/var/log# grep '^Feb 28 16:[23]' syslog.1
Feb 28 16:20:50 gx01 kernel: [103278.185814] audit: type=1400 audit(1709133650.187:1080): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1278914/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:21:20 gx01 kernel: [103308.432689] audit: type=1400 audit(1709133680.436:1081): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1279373/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:21:30 gx01 kernel: [103318.719235] audit: type=1400 audit(1709133690.720:1082): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1279512/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:21:31 gx01 kernel: [103319.282115] audit: type=1400 audit(1709133691.284:1083): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1279555/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:21:31 gx01 kernel: [103319.491742] audit: type=1400 audit(1709133691.492:1084): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1279656/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:21:57 gx01 systemd[1]: Started NVIDIA DCGM service.
Feb 28 16:21:57 gx01 kernel: [103345.680201] nvidia-nvswitch5: open (major=510)
Feb 28 16:21:57 gx01 kernel: [103345.723011] nvidia-nvswitch4: open (major=510)
Feb 28 16:21:57 gx01 kernel: [103345.765617] nvidia-nvswitch0: open (major=510)
Feb 28 16:21:57 gx01 kernel: [103345.808026] nvidia-nvswitch2: open (major=510)
Feb 28 16:21:57 gx01 kernel: [103345.850474] nvidia-nvswitch1: open (major=510)
Feb 28 16:21:57 gx01 kernel: [103345.892846] nvidia-nvswitch3: open (major=510)
Feb 28 16:21:58 gx01 nv-hostengine: DCGM initialized
Feb 28 16:21:58 gx01 nv-hostengine[1280055]: Started host engine version 3.3.5 using port number: 5555
Feb 28 16:22:03 gx01 kernel: [103351.124025] audit: type=1400 audit(1709133723.125:1085): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1280110/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:22:32 gx01 systemd[1]: Started DCGM Exporter.
Feb 28 16:22:32 gx01 kernel: [103380.216261] audit: type=1400 audit(1709133752.217:1086): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/28026/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:22:32 gx01 dcgm-exporter[1280577]: /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter)
Feb 28 16:22:32 gx01 dcgm-exporter[1280577]: /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter)
Feb 28 16:22:32 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 16:22:32 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'.
Feb 28 16:23:02 gx01 systemd[1]: dcgm-exporter.service: Scheduled restart job, restart counter is at 3.
Feb 28 16:23:02 gx01 systemd[1]: Stopped DCGM Exporter.
Feb 28 16:23:02 gx01 systemd[1]: Started DCGM Exporter.
Feb 28 16:23:02 gx01 dcgm-exporter[1280963]: /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter)
Feb 28 16:23:02 gx01 dcgm-exporter[1280963]: /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter)
Feb 28 16:23:02 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 16:23:02 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'.
Feb 28 16:23:07 gx01 systemd[1]: Stopped DCGM Exporter.
Feb 28 16:24:11 gx01 kernel: [103479.881565] audit: type=1400 audit(1709133851.887:1087): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281838/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:24:12 gx01 kernel: [103480.388663] audit: type=1400 audit(1709133852.395:1088): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281885/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:24:12 gx01 kernel: [103480.539563] audit: type=1400 audit(1709133852.543:1089): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281908/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:24:13 gx01 kernel: [103481.137739] audit: type=1400 audit(1709133853.143:1090): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281946/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:24:13 gx01 kernel: [103481.651807] audit: type=1400 audit(1709133853.655:1091): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281992/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:24:13 gx01 kernel: [103481.804767] audit: type=1400 audit(1709133853.811:1092): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1282016/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:24:54 gx01 systemd[1]: tmp-.esp_tmp-nvme1n1p1.mount: Succeeded.
Feb 28 16:24:54 gx01 systemd[1]: tmp-.esp_tmp-nvme2n1p1.mount: Succeeded.
Feb 28 16:25:01 gx01 kernel: [103529.974717] audit: type=1400 audit(1709133901.980:1093): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1282647/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:25:01 gx01 CRON[1282648]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 28 16:25:01 gx01 kernel: [103529.976017] audit: type=1400 audit(1709133901.984:1094): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1282648/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:25:52 gx01 kernel: [103580.650881] audit: type=1400 audit(1709133952.657:1095): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1284754/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:26:22 gx01 kernel: [103610.898676] audit: type=1400 audit(1709133982.906:1096): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1285172/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:28:11 gx01 kernel: [103719.026823] audit: type=1400 audit(1709134091.036:1097): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1297267/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:28:11 gx01 kernel: [103719.579399] audit: type=1400 audit(1709134091.588:1098): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1297309/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:28:11 gx01 kernel: [103719.755666] audit: type=1400 audit(1709134091.764:1099): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1297333/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:29:08 gx01 systemd[1]: Started DCGM Exporter.
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Starting dcgm-exporter"
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="DCGM successfully initialized!"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Collecting DCP Metrics"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Falling back to metric file '/net/mgmtdelab/pool/html/dcgm/current/counters.csv'"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Initializing system entities of type: GPU"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvSwitch"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvLink"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU Core"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU Core metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
Feb 28 16:29:13 gx01 kernel: [103781.951727] nv-hostengine[1280071]: segfault at 28 ip 00007f09f65c74b2 sp 00007f09f61e2ba0 error 6 in libdcgmmodulenvswitch.so.3.3.5[7f09f658c000+f8000]
Feb 28 16:29:13 gx01 kernel: [103781.951735] Code: 7d b8 44 88 6d b0 e8 7d 0a ff ff 48 8b 45 a8 48 8b 73 18 48 89 45 c0 48 3b 73 20 0f 84 df 00 00 00 66 0f 6f 45 b0 48 83 c6 18 <0f> 11 46 e8 48 8b 45 c0 48 89 46 f8 48 89 73 18 48 8d 65 d8 5b 41
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="can not destroy group" error="Error destroying group: Host engine connection invalid/disconnected" groupID="{21}"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="Cannot destroy field group." error="Host engine connection invalid/disconnected"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=fatal msg="Failed to watch metrics: Error watching fields: Host engine connection invalid/disconnected"
Feb 28 16:29:14 gx01 kernel: [103782.107651] nvidia-nvswitch3: release (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.107656] nvidia-nvswitch1: release (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.107658] nvidia-nvswitch2: release (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.107661] nvidia-nvswitch0: release (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.107663] nvidia-nvswitch4: release (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.107665] nvidia-nvswitch5: release (major=510)
Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'.
Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Main process exited, code=killed, status=11/SEGV
Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Failed with result 'signal'.
Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Scheduled restart job, restart counter is at 1.
Feb 28 16:29:14 gx01 systemd[1]: Stopped NVIDIA DCGM service.
Feb 28 16:29:14 gx01 systemd[1]: Started NVIDIA DCGM service.
Feb 28 16:29:14 gx01 kernel: [103782.832440] nvidia-nvswitch5: open (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.875110] nvidia-nvswitch4: open (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.918412] nvidia-nvswitch0: open (major=510)
Feb 28 16:29:14 gx01 kernel: [103782.961611] nvidia-nvswitch2: open (major=510)
Feb 28 16:29:15 gx01 kernel: [103783.004035] nvidia-nvswitch1: open (major=510)
Feb 28 16:29:15 gx01 kernel: [103783.046633] nvidia-nvswitch3: open (major=510)
Feb 28 16:29:15 gx01 nv-hostengine: DCGM initialized
Feb 28 16:29:15 gx01 nv-hostengine[1298239]: Started host engine version 3.3.5 using port number: 5555
Feb 28 16:29:24 gx01 systemd[1]: Stopped DCGM Exporter.
Feb 28 16:29:24 gx01 systemd[1]: Stopping NVIDIA DCGM service...
Feb 28 16:29:24 gx01 kernel: [103792.219790] nvidia-nvswitch3: release (major=510)
Feb 28 16:29:24 gx01 kernel: [103792.219989] nvidia-nvswitch1: release (major=510)
Feb 28 16:29:24 gx01 kernel: [103792.220182] nvidia-nvswitch2: release (major=510)
Feb 28 16:29:24 gx01 kernel: [103792.220374] nvidia-nvswitch0: release (major=510)
Feb 28 16:29:24 gx01 kernel: [103792.220575] nvidia-nvswitch4: release (major=510)
Feb 28 16:29:24 gx01 kernel: [103792.220761] nvidia-nvswitch5: release (major=510)
Feb 28 16:29:24 gx01 systemd[1]: nvidia-dcgm.service: Succeeded.
Feb 28 16:29:24 gx01 systemd[1]: Stopped NVIDIA DCGM service.
Feb 28 16:29:32 gx01 slurmd[27067]: slurmd: launch task StepId=838896.12 request from UID:12211 GID:5101 HOST:172.20.26.64 PORT:39002
Feb 28 16:29:32 gx01 kernel: [103800.729913] audit: type=1400 audit(1709134172.738:1100): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/27067/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:29:32 gx01 slurmd[27067]: slurmd: task/affinity: lllp_distribution: JobId=838896 implicit auto binding: cores, dist 1
Feb 28 16:29:32 gx01 slurmd[27067]: slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
Feb 28 16:29:32 gx01 slurmd[27067]: slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [838896]: mask_cpu, 0x0000000000000001000000000000000000000000000000010000000000000000
Feb 28 16:29:54 gx01 systemd[1]: tmp-.esp_tmp-nvme1n1p1.mount: Succeeded.
Feb 28 16:29:54 gx01 systemd[1]: tmp-.esp_tmp-nvme2n1p1.mount: Succeeded.
Feb 28 16:30:55 gx01 kernel: [103883.096874] audit: type=1400 audit(1709134255.111:1101): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1299484/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:31:25 gx01 kernel: [103913.358790] audit: type=1400 audit(1709134285.372:1102): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1299958/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:32:28 gx01 kernel: [103976.699653] audit: type=1400 audit(1709134348.713:1103): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/28026/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:34:54 gx01 systemd[1]: tmp-.esp_tmp-nvme1n1p1.mount: Succeeded.
Feb 28 16:34:54 gx01 systemd[1]: tmp-.esp_tmp-nvme2n1p1.mount: Succeeded.
Feb 28 16:35:01 gx01 CRON[1302542]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 28 16:35:01 gx01 kernel: [104129.968910] audit: type=1400 audit(1709134501.985:1104): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1302541/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:35:01 gx01 kernel: [104129.970106] audit: type=1400 audit(1709134501.985:1105): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1302542/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:35:57 gx01 kernel: [104185.572325] audit: type=1400 audit(1709134557.590:1106): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1303272/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:36:27 gx01 kernel: [104215.823679] audit: type=1400 audit(1709134587.842:1107): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1303523/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Feb 28 16:39:54 gx01 systemd[1]: tmp-.esp_tmp-nvme1n1p1.mount: Succeeded.
Feb 28 16:39:54 gx01 systemd[1]: tmp-.esp_tmp-nvme2n1p1.mount: Succeeded.
root@gx01:/var/log#

Reproducible? for me everytime. Coredump:

coredumpctl
# coredumpctl dump nv-hostengine >nv-hostengine.core
           PID: 3124092 (nv-hostengine)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Fri 2024-03-01 10:07:47 CET (59s ago)
  Command Line: /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
    Executable: /usr/bin/nv-hostengine
 Control Group: /system.slice/nvidia-dcgm.service
          Unit: nvidia-dcgm.service
         Slice: system.slice
       Boot ID: 35a3b73c95c04716880c91638ed46a93
    Machine ID: 5b05d12040d24d8e9c8d38117ab12eba
      Hostname: gx01
       Storage: /var/lib/systemd/coredump/core.nv-hostengine.0.35a3b73c95c04716880c91638ed46a93.3124092.1709284067000000000000.lz4
       Message: Process 3124092 (nv-hostengine) of user 0 dumped core.

                Stack trace of thread 3124104:
                #0  0x00007fc97150a4b2 n/a (libdcgmmodulenvswitch.so.3 + 0x594b2)
                #1  0x00007fc9711c5456 n/a (libnvidia-nscq.so.2 + 0x9d456)
                #2  0x00007fc9711760a3 n/a (libnvidia-nscq.so.2 + 0x4e0a3)
                #3  0x00007fc9711756bf n/a (libnvidia-nscq.so.2 + 0x4d6bf)
                #4  0x00007fc97117eb80 nscq_session_path_observe (libnvidia-nscq.so.2 + 0x56b80)
                #5  0x00007fc9715530e7 n/a (libdcgmmodulenvswitch.so.3 + 0xa20e7)
                #6  0x00007fc97152038f n/a (libdcgmmodulenvswitch.so.3 + 0x6f38f)
                #7  0x00007fc9714eb19d n/a (libdcgmmodulenvswitch.so.3 + 0x3a19d)
                #8  0x00007fc9714d6e9f n/a (libdcgmmodulenvswitch.so.3 + 0x25e9f)
                #9  0x00007fc9714d7834 n/a (libdcgmmodulenvswitch.so.3 + 0x26834)
                #10 0x00007fc9714dafd8 n/a (libdcgmmodulenvswitch.so.3 + 0x29fd8)
                #11 0x00007fc9714dc4a4 n/a (libdcgmmodulenvswitch.so.3 + 0x2b4a4)
                #12 0x00007fc9714e43b6 n/a (libdcgmmodulenvswitch.so.3 + 0x333b6)
                #13 0x00007fc9714dabb1 n/a (libdcgmmodulenvswitch.so.3 + 0x29bb1)
                #14 0x00007fc971561e5b n/a (libdcgmmodulenvswitch.so.3 + 0xb0e5b)
                #15 0x00007fc9715623a9 n/a (libdcgmmodulenvswitch.so.3 + 0xb13a9)
                #16 0x00007fc97417a609 start_thread (libpthread.so.0 + 0x8609)
                #17 0x00007fc973f2f353 __clone (libc.so.6 + 0x11f353)

                Stack trace of thread 3124101:
                #0  0x00007fc973f2895d syscall (libc.so.6 + 0x11895d)
                #1  0x00007fc971598791 n/a (libdcgmmodulenvswitch.so.3 + 0xe7791)
                #2  0x00007fc9714d8a74 n/a (libdcgmmodulenvswitch.so.3 + 0x27a74)
                #3  0x00007fc97152a408 n/a (libdcgmmodulenvswitch.so.3 + 0x79408)
                #4  0x00007fc974201343 n/a (libdcgm.so.3 + 0x6c343)
                #5  0x00007fc9742c742e n/a (libdcgm.so.3 + 0x13242e)
                #6  0x00007fc974326025 n/a (libdcgm.so.3 + 0x191025)
                #7  0x00007fc974334e9c n/a (libdcgm.so.3 + 0x19fe9c)
                #8  0x00007fc974321438 n/a (libdcgm.so.3 + 0x18c438)
                #9  0x00007fc9742c9aa7 n/a (libdcgm.so.3 + 0x134aa7)
                #10 0x00007fc9742c9d2d n/a (libdcgm.so.3 + 0x134d2d)
                #11 0x00007fc9742c9f1f n/a (libdcgm.so.3 + 0x134f1f)
                #12 0x00007fc9742a71fe n/a (libdcgm.so.3 + 0x1121fe)
                #13 0x00007fc974348871 n/a (libdcgm.so.3 + 0x1b3871)
                #14 0x00007fc974352ad8 n/a (libdcgm.so.3 + 0x1bdad8)
                #15 0x00007fc9743530a4 n/a (libdcgm.so.3 + 0x1be0a4)
                #16 0x00007fc974231de6 n/a (libdcgm.so.3 + 0x9cde6)
                #17 0x00007fc974356192 n/a (libdcgm.so.3 + 0x1c1192)
                #18 0x00007fc9744111c8 n/a (libdcgm.so.3 + 0x27c1c8)
                #19 0x00007fc97417a609 start_thread (libpthread.so.0 + 0x8609)
                #20 0x00007fc973f2f353 __clone (libc.so.6 + 0x11f353)

                Stack trace of thread 3124092:
                #0  0x00007fc973eed23f clock_nanosleep (libc.so.6 + 0xdd23f)
                #1  0x00007fc973ef2ec7 __nanosleep (libc.so.6 + 0xe2ec7)
                #2  0x000000000040736b n/a (nv-hostengine + 0x736b)
                #3  0x00007fc973e34083 __libc_start_main (libc.so.6 + 0x24083)
                #4  0x00000000004079bc n/a (nv-hostengine + 0x79bc)

core.nv-hostengine.0.35a3b73c95c04716880c91638ed46a93.3124092.1709284067000000000000.lz4.zip

krono avatar Mar 01 '24 09:03 krono

Thank you for the dumps, I am currently looking into it. Will share my findings here.

superg avatar Mar 01 '24 17:03 superg

I narrowed the search down to one of the NSCQ observe callbacks in: https://github.com/NVIDIA/DCGM/blob/master/modules/nvswitch/DcgmNvSwitchManager.cpp#L851C45-L851C53

To understand more I would like to request debug level logs when the crash happen, here's how to do it: Make sure nvidia-dcgm service is running (nv-hostengine), execute dcgmi set --logging-severity DEBUG, that will set nv-hostengine logging level to DEBUG. Next, reproduce the crash and share /var/nv-hostengine.log (feel free to clear it beforehand if needed).

superg avatar Mar 04 '24 15:03 superg

Hi, here's the log nv-hostengine.log

krono avatar Mar 05 '24 08:03 krono

I do not see the log point log_debug("Attaching to NvSwitches"); being hit…

krono avatar Mar 05 '24 08:03 krono

Thank you for the logs. indeed it crashed in another NSCQ observe callback: https://github.com/NVIDIA/DCGM/blob/master/modules/nvswitch/FieldDefinitions.cpp#L164

I am currently looking into the chain of events that led into this, will reply once I have more information.

superg avatar Mar 05 '24 15:03 superg

Unfortunately we aren't able to reproduce this issue internally. However we've added better debugging to help diagnose such issues in the future and at some point it will be merged to GitHub.

superg avatar Apr 01 '24 14:04 superg

Unfortunately we aren't able to reproduce this issue internally. However we've added better debugging to help diagnose such issues in the future and at some point it will be merged to GitHub.

superg avatar Apr 01 '24 14:04 superg

Is there any way i can debug that? Like a step through debugger?

krono avatar Apr 01 '24 15:04 krono

Yes, basically you will have to build debug DCGM with symbols. Then you will be able to use GDB, step through code and inspect variables etc. Put a breakpoint here: https://github.com/NVIDIA/DCGM/blob/master/modules/nvswitch/FieldDefinitions.cpp#L164 Just want to mention that this is pretty advanced and involves using ./build.sh script and docker dcgmbuild container (our build is containerized) and running nv-hostengine locally.

superg avatar Apr 01 '24 19:04 superg

So here we are.

Debuggin around this:

    auto cb = [](const indexTypes... indicies,
                 nscq_rc_t rc,
                 TempData<nscqFieldType, storageType, is_vector, indexTypes...>::cbType in,
                 NscqDataCollector<TempData<nscqFieldType, storageType, is_vector, indexTypes...>> *dest) {
        if (dest == nullptr)
        {
            log_error("NSCQ passed dest = nullptr");

            return;
        }

        dest->callCounter++;

        if (NSCQ_ERROR(rc))
        {
            log_error("NSCQ {} passed error {}", dest->nscqPath, (int)rc);

            TempData<nscqFieldType, storageType, is_vector, indexTypes...> item;

            item.CollectFunc(dest, indicies...);

            return;
        }

        TempData<nscqFieldType, storageType, is_vector, indexTypes...> item; /* BREAKPOINT HERE */

        item.CollectFunc(dest, in, indicies...);
    };

shows:

Normal behavior for stuff like tempreatures or throughput:

gdb debug output for `*dest`: normal stuff
Thread 5 "nv-hostengine" hit Breakpoint 2, DcgmNs::DcgmNvSwitchManager::UpdateFields<nscq_link_throughput_t, DcgmNs::FieldIdStorageType<(unsigned short)862>, false, nscq_uuid_t*>(unsigned short, DcgmFvBuffer&, std::vector<dcgm_field_update_info_t, std::allocator<dcgm_field_update_info_t> > const&, long)::{lambda(nscq_uuid_t*, signed char, nscq_link_throughput_t, DcgmNs::NscqDataCollector<DcgmNs::TempData<nscq_link_throughput_t, DcgmNs::FieldIdStorageType<(unsigned short)862>, false, nscq_uuid_t*> >*)#1}::operator()(nscq_uuid_t*, signed char, nscq_link_throughput_t, DcgmNs::NscqDataCollector<DcgmNs::TempData<nscq_link_throughput_t, DcgmNs::FieldIdStorageType<(unsigned short)862>, false, nscq_uuid_t*> >*) const (__closure=0x0, indicies#0=0x564fc0, rc=0 '\000', in=..., dest=0x7ffff4afb290) at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:162
162	        TempData<nscqFieldType, storageType, is_vector, indexTypes...> item;
$64 = {
  callCounter = 6,
  fieldId = 862,
  nscqPath = 0x7ffff4fedcc0 <nscq_nvswitch_nvlink_throughput_counters> "/{nvswitch}/nvlink/throughput_counters",
  data = std::vector of length 5, capacity 8 = {{
      index = std::tuple containing = {
        [1] = 0x564ef0
      },
      data = {
        <DcgmNs::NvSwitch::Data::Uint64Data> = {
          value = 0
        },
        members of DcgmNs::FieldIdStorageType<862>:
        static fieldId = 862
      }
    }, {
      index = std::tuple containing = {
        [1] = 0x564d50
      },
      data = {
        <DcgmNs::NvSwitch::Data::Uint64Data> = {
          value = 0
        },
        members of DcgmNs::FieldIdStorageType<862>:
        static fieldId = 862
      }
    }, {
      index = std::tuple containing = {
        [1] = 0x564e20
      },
      data = {
        <DcgmNs::NvSwitch::Data::Uint64Data> = {
          value = 0
        },
        members of DcgmNs::FieldIdStorageType<862>:
        static fieldId = 862
      }
    }, {
      index = std::tuple containing = {
        [1] = 0x53df00
      },
      data = {
        <DcgmNs::NvSwitch::Data::Uint64Data> = {
          value = 0
        },
        members of DcgmNs::FieldIdStorageType<862>:
        static fieldId = 862
      }
    }, {
      index = std::tuple containing = {
        [1] = 0x565090
      },
      data = {
        <DcgmNs::NvSwitch::Data::Uint64Data> = {
          value = 0
        },
        members of DcgmNs::FieldIdStorageType<862>:
        static fieldId = 862
      }
    }}
}

This is more or less expected.

It seems something breaks for "physical id":

  1. We see the backtrace requests "/{nvswitch}/id/phys_id"
gdb bt at that point for `phys id
(gdb) bt
#0  DcgmNs::DcgmNvSwitchManager::UpdateFields<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char>(unsigned short, DcgmFvBuffer&, std::vector<dcgm_field_update_info_t, std::allocator<dcgm_field_update_info_t> > const&, long)::{lambda(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*)#1}::operator()(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*) const (__closure=0x0, indicies#0=0x564ef0, indicies#1=0 '\000', rc=11 '\v', in=140737298543232, dest=0x716a20) at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:162
#1  0x00007ffff4eeb858 in DcgmNs::DcgmNvSwitchManager::UpdateFields<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char>(unsigned short, DcgmFvBuffer&, std::vector<dcgm_field_update_info_t, std::allocator<dcgm_field_update_info_t> > const&, long)::{lambda(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*)#1}::_FUN(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*) () at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:138
#2  0x00007ffff4b9b456 in ?? () from /lib/x86_64-linux-gnu/libnvidia-nscq.so.2
#3  0x00007ffff4b4c0a3 in ?? () from /lib/x86_64-linux-gnu/libnvidia-nscq.so.2
#4  0x00007ffff4b4b6bf in ?? () from /lib/x86_64-linux-gnu/libnvidia-nscq.so.2
#5  0x00007ffff4b54b80 in nscq_session_path_observe () from /lib/x86_64-linux-gnu/libnvidia-nscq.so.2
#6  0x00007ffff4f636ca in nscq_session_path_observe (session=0x7681b0, path=0x7ffff4fed8d0 <nscq_nvswitch_phys_id> "/{nvswitch}/id/phys_id", callback=0x7ffff4eeb813 <DcgmNs::DcgmNvSwitchManager::UpdateFields<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char>(unsigned short, DcgmFvBuffer&, std::vector<dcgm_field_update_info_t, std::allocator<dcgm_field_update_info_t> > const&, long)::{lambda(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*)#1}::_FUN(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*)>, data=0x7ffff4afb280, flags=0) at /srv/DCGM/sdk/nvidia/nscq/dlwrap/dlwrap.c:131
#7  0x00007ffff4eeb98d in DcgmNs::DcgmNvSwitchManager::UpdateFields<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> (this=0x53cb10, fieldId=863, buf=..., entities=std::vector of length 1, capacity 1 = {...}, now=1712069504430580) at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:167
#8  0x00007ffff4ec28a5 in DcgmNs::DcgmNvSwitchManager::UpdateFields (this=0x53cb10, nextUpdateTime=@0x7ffff4afb528: 1712069518487312) at /srv/DCGM/modules/nvswitch/DcgmNvSwitchManager.cpp:592
#9  0x00007ffff4ea9b6a in DcgmNs::DcgmModuleNvSwitch::RunOnce (this=0x53c970) at /srv/DCGM/modules/nvswitch/DcgmModuleNvSwitch.cpp:400
#10 0x00007ffff4ea9d6d in DcgmNs::DcgmModuleNvSwitch::TryRunOnce (this=0x53c970, forceRun=true) at /srv/DCGM/modules/nvswitch/DcgmModuleNvSwitch.cpp:419
#11 0x00007ffff4ea8428 in operator() (__closure=0x7fffd4036cf0) at /srv/DCGM/modules/nvswitch/DcgmModuleNvSwitch.cpp:273
#12 0x00007ffff4eaadae in std::__invoke_impl<void, DcgmNs::DcgmModuleNvSwitch::ProcessMessageFromTaskRunner(dcgm_module_command_header_t*)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:61
#13 0x00007ffff4eaabdb in std::__invoke_r<void, DcgmNs::DcgmModuleNvSwitch::ProcessMessageFromTaskRunner(dcgm_module_command_header_t*)::<lambda()>&>(struct {...} &) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:111
#14 0x00007ffff4eaa937 in std::_Function_handler<void(), DcgmNs::DcgmModuleNvSwitch::ProcessMessageFromTaskRunner(dcgm_module_command_header_t*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/std_function.h:291
#15 0x00007ffff4ebb3f4 in std::function<void ()>::operator()() const (this=0x7fffd4036cf0) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/std_function.h:560
#16 0x00007ffff4eb92ad in std::__invoke_impl<void, std::function<void ()> const&>(std::__invoke_other, std::function<void ()> const&) (__f=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:61
#17 0x00007ffff4eb646b in std::__invoke<std::function<void ()> const&>(std::function<void ()> const&) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:96
#18 0x00007ffff4eb1267 in std::invoke<std::function<void ()> const&>(std::function<void ()> const&) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/functional:97
#19 0x00007ffff4eada36 in DcgmNs::Task<void>::Task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<void ()>)::{lambda()#1}::operator()() const (__closure=0x7fffd4036cf0) at /srv/DCGM/common/Task.hpp:215
#20 0x00007ffff4ebb46a in std::__invoke_impl<int, DcgmNs::Task<void>::Task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<void ()>)::{lambda()#1}&>(std::__invoke_other, DcgmNs::Task<void>::Task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<void ()>)::{lambda()#1}&) (__f=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:61
#21 0x00007ffff4eb9474 in std::__invoke_r<std::optional<int>, DcgmNs::Task<void>::Task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<void ()>)::{lambda()#1}&>(std::optional<int>&&, (DcgmNs::Task<void>::Task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<void ()>)::{lambda()#1}&)...) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:114
#22 0x00007ffff4eb6538 in std::_Function_handler<std::optional<int> (), DcgmNs::Task<void>::Task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/std_function.h:291
#23 0x00007ffff4ec08da in std::function<std::optional<int> ()>::operator()() const (this=0x76bdf0) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/std_function.h:560
#24 0x00007ffff4ec0522 in std::__invoke_impl<std::optional<int>, std::function<std::optional<int> ()>&>(std::__invoke_other, std::function<std::optional<int> ()>&) (__f=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:61
#25 0x00007ffff4ec028e in std::__invoke<std::function<std::optional<int> ()>&>(std::function<std::optional<int> ()>&) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:96
#26 0x00007ffff4ebffd0 in std::invoke<std::function<std::optional<int> ()>&>(std::function<std::optional<int> ()>&) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/functional:97
#27 0x00007ffff4ebfc4f in DcgmNs::NamedBasicTask<int, void>::Run (this=0x76bde0) at /srv/DCGM/common/Task.hpp:155
#28 0x00007ffff4eaeca9 in DcgmNs::TaskRunner::Run (this=0x53ca58, oneIteration=true) at /srv/DCGM/common/TaskRunner.hpp:432
#29 0x00007ffff4ea9e2c in DcgmNs::DcgmModuleNvSwitch::run (this=0x53c970) at /srv/DCGM/modules/nvswitch/DcgmModuleNvSwitch.cpp:433
#30 0x00007ffff4f6bba4 in DcgmThread::RunInternal (this=0x53c9b8) at /srv/DCGM/common/DcgmThread/DcgmThread.cpp:308
#31 0x00007ffff4f6a7c5 in dcgmthread_starter (parm=0x53c9b8) at /srv/DCGM/common/DcgmThread/DcgmThread.cpp:34
#32 0x00007ffff7bfa609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#33 0x00007ffff79af353 in clone () from /lib/x86_64-linux-gnu/libc.so.6
  1. From that we would expect fieldId to be 863, but it is proabbly garbage: "32767"
gdb debug output for `*dest`: strange stuff
Thread 5 "nv-hostengine" hit Breakpoint 2, DcgmNs::DcgmNvSwitchManager::UpdateFields<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char>(unsigned short, DcgmFvBuffer&, std::vector<dcgm_field_update_info_t, std::allocator<dcgm_field_update_info_t> > const&, long)::{lambda(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*)#1}::operator()(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector<DcgmNs::TempData<unsigned long, DcgmNs::FieldIdStorageType<(unsigned short)863>, false, nscq_uuid_t*, unsigned char> >*) const (__closure=0x0, indicies#0=0x564ef0, indicies#1=0 '\000', rc=11 '\v', in=140737298543232, dest=0x716a20) at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:162
162	        TempData<nscqFieldType, storageType, is_vector, indexTypes...> item;
$65 = {
  callCounter = 4108821041,
  fieldId = 32767,
  nscqPath = 0x712960 "SWX-F8F7054E-5993-EB8D-786D-B59D5303DB16",
  data = std::vector of length 0, capacity -1
}

The callCounter looks goofy, too. Most important, the nscqPath is not the expected "/{nvswitch}/id/phys_id" but rather the value?

-=-=-=-

It seem that there's something wrong in my /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.2, because it looks like the library is just calling this with broken info.

Lib info:

apt policy libnvidia-nscq-535
libnvidia-nscq-535:
  Installed: 535.154.05-0ubuntu0.20.04.1
  Candidate: 535.161.07-0ubuntu0.20.04.1
  Version table:
     535.161.08-1 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     535.161.07-1 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     535.161.07-0ubuntu0.20.04.1 600
        500 http://de.archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 Packages
        500 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages
     535.154.05-1 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
 *** 535.154.05-0ubuntu0.20.04.1 100
        100 /var/lib/dpkg/status
     535.129.03-1 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     535.104.12-1 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     535.104.05-1 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     535.86.10-1 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     535.54.03-1 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages

So the promblem is probably not DCGM but rahter lib NSCQ?

krono avatar Apr 02 '24 15:04 krono

Hi, thank you for the details! Let me process this information and get back to you.

superg avatar Apr 04 '24 14:04 superg

@krono , I apologize for the long wait. We've managed to reproduce the issue on our side. While our call stack is different, the source of the problem is very likely to be the same and your observations on std::vector<> with garbage supports it. I believe that the fix that we're working on will resolve it.

superg avatar Apr 10 '24 18:04 superg

thanks :)

krono avatar Apr 11 '24 06:04 krono

Hi @superg , any news or any place I can read up on the issue here?

krono avatar May 13 '24 09:05 krono

Hi @krono, We have an internal tracking ticket for this issue and an assigned developer, this is still work in progress.

The issue is with the callback signature (after all the template instantiations), we use: void callback(const nscq_uuid_t* device, nscq_rc_t rc, std::vector<nscq_error_t>, void *data) whereas NSCQ expects: void callback(const nscq_uuid_t* device, nscq_rc_t rc, const nscq_error_t error, void* data) for a given path type. Callback code has to be rewritten for the second signature.

superg avatar May 15 '24 12:05 superg

oh my.

Which component will need updat? DCGM or NSCQ?

krono avatar May 15 '24 12:05 krono

That's in DCGM.

superg avatar May 15 '24 12:05 superg

Thanks! I'll keep watching this space

krono avatar May 15 '24 12:05 krono

I have now a second machine that fell victim to that problem: HGX-Bases system, similarly configured.

krono avatar Jun 05 '24 14:06 krono

Hey, any news?

krono avatar Jul 01 '24 06:07 krono

@krono, the issue is identified and we are working on a fix. The current ETA is August.

superg avatar Jul 01 '24 15:07 superg

@superg Is this included in #189 or #180 ?

krono avatar Sep 10 '24 13:09 krono

To answer my own question: NO.

krono avatar Sep 11 '24 09:09 krono

@superg any news?

krono avatar Sep 27 '24 06:09 krono

Hi @krono , I'm sorry I moved to another project some time ago, I will enquire.

superg avatar Sep 27 '24 13:09 superg