DCGM
DCGM copied to clipboard
dcgmproftester not working with v3.1.7
Hi,
dcgmproftester11 gives me the following error with DCGM v3.1.7. This used to work with v3.1.3. Is there a known bug here?
$ dcgmproftester11 --no-dcgm-validation -t 1004 -d 10
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
2023-04-18 03:46:10.915 ERROR [2843478:2843478] Error 0 from RunTests(). Exiting. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DcgmProfTester.cpp:1170] [main]
$
Following are the logs:
$ dcgmproftester11 --no-dcgm-validation -t 1004 -d 10 --log-level DEBUG --log-file ./dcgm_log.txt
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
2023-04-19 01:44:54.500 ERROR [4187295:4187295] Error 0 from RunTests(). Exiting. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DcgmProfTester.cpp:1170] [main]
$
$
$ cat ./dcgm_log.txt
2023-04-19 01:44:13.157 INFO [4187295:4187295] Skipping CreateDcgmGroups() since DCGM validation is disabled. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DcgmProfTester.cpp:117] [DcgmProfTester::CreateDcgmGroups]
2023-04-19 01:44:13.204 INFO [4187295:4187295] Skipping CreateDcgmGroups() since DCGM validation is disabled. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/PhysicalGpu.cpp:1535] [DcgmNs::ProfTester::PhysicalGpu::CreateDcgmGroups]
2023-04-19 01:44:13.204 INFO [4187295:4187295] Skipping CreateDcgmGroups() since DCGM validation is disabled. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/PhysicalGpu.cpp:1535] [DcgmNs::ProfTester::PhysicalGpu::CreateDcgmGroups]
2023-04-19 01:44:13.204 INFO [4187295:4187295] Skipping WatchFields() since DCGM validation is disabled. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DcgmProfTester.cpp:193] [DcgmProfTester::WatchFields]
2023-04-19 01:44:13.440 INFO [4187301:4187301] DCGM CudaContext Init completed successfully. Starting our TaskRunner. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:213] [CudaWorkerThread::Init]
2023-04-19 01:44:13.440 INFO [4187301:4187301] Created thread named "" ID 3100958720 DcgmThread ptr 0x0x117da28 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/common/DcgmThread/DcgmThread.cpp:116] [DcgmThread::Start]
2023-04-19 01:44:13.440 DEBUG [4187301:4187306] Thread handle 3100958720 running [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/common/DcgmThread/DcgmThread.cpp:305] [DcgmThread::RunInternal]
2023-04-19 01:44:13.559 INFO [4187302:4187302] DCGM CudaContext Init completed successfully. Starting our TaskRunner. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:213] [CudaWorkerThread::Init]
2023-04-19 01:44:13.559 INFO [4187302:4187302] Created thread named "" ID 3100958720 DcgmThread ptr 0x0x117e588 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/common/DcgmThread/DcgmThread.cpp:116] [DcgmThread::Start]
2023-04-19 01:44:13.559 DEBUG [4187302:4187332] Thread handle 3100958720 running [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/common/DcgmThread/DcgmThread.cpp:305] [DcgmThread::RunInternal]
2023-04-19 01:44:13.921 ERROR [4187301:4187306] Unable to load cuda module DcgmProfTesterKernels.ptx. cuSt: 222 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:165] [CudaWorkerThread::LoadModule]
2023-04-19 01:44:13.921 ERROR [4187301:4187306] loadModule failed with -3 for 0 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:395] [CudaWorkerThread::AttachToCudaDeviceFromTaskThread]
2023-04-19 01:44:13.921 ERROR [4187301:4187301] AttachToCudaDevice(0) returned -3 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:227] [CudaWorkerThread::Init]
2023-04-19 01:44:13.921 ERROR [4187301:4187301] m_cudaWorker.Init failed with -3 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DistributedCudaContext.cpp:1598] [DcgmNs::ProfTester::DistributedCudaContext::RunTest]
2023-04-19 01:44:13.922 ERROR [4187302:4187332] Unable to load cuda module DcgmProfTesterKernels.ptx. cuSt: 222 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:165] [CudaWorkerThread::LoadModule]
2023-04-19 01:44:13.922 ERROR [4187302:4187332] loadModule failed with -3 for 0 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:395] [CudaWorkerThread::AttachToCudaDeviceFromTaskThread]
2023-04-19 01:44:13.922 ERROR [4187302:4187302] AttachToCudaDevice(0) returned -3 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:227] [CudaWorkerThread::Init]
2023-04-19 01:44:13.922 ERROR [4187302:4187302] m_cudaWorker.Init failed with -3 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DistributedCudaContext.cpp:1598] [DcgmNs::ProfTester::DistributedCudaContext::RunTest]
2023-04-19 01:44:54.500 INFO [4187295:4187295] Skipping UnwatchFields() since DCGM validation is disabled. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DcgmProfTester.cpp:244] [DcgmProfTester::UnwatchFields]
2023-04-19 01:44:54.500 ERROR [4187295:4187295] Error 0 from RunTests(). Exiting. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DcgmProfTester.cpp:1170] [main]
$
CUDA version on the host: 11.8.89
$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
$
@starry91,
Could you clarify which driver version is installed?
@nikkon-dev, I am using 520.61.05
$ nvidia-smi
Wed Apr 19 15:42:54 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:5E:00.0 Off | 0 |
| N/A 35C P8 15W / 70W | 2MiB / 15360MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:D8:00.0 Off | 0 |
| N/A 33C P8 15W / 70W | 2MiB / 15360MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$
@starry91,
Could you replace the ptx file next to the dcgmproftester11 with the attached one and see if that works?
I copied the DcgmProfTesterKernels.ptx file that you'd given and the test binary dcgmproftester11 is working successfully.
Does this .ptx file come with the dcgm package? If so, the package probably doesn't attach the latest file.
@nikkon-dev Can you please confirm if this a bug on DCGM side only or is it because of some mismatching config on our end?
@starry91,
This is an issue on the DCGM side. The ptx file was built with a newer version of the CUDA SDK than it should have been. This will be fixed in the next patch release.
@nikkon-dev Can you please confirm if this affects only the dcgmProfTester
or something else as well?
@starry91,
That affects the dcgmproftester only.