DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

dcgmproftester not working with v3.1.7

Open starry91 opened this issue 1 year ago • 8 comments

Hi,

dcgmproftester11 gives me the following error with DCGM v3.1.7. This used to work with v3.1.3. Is there a known bug here?

$ dcgmproftester11 --no-dcgm-validation -t 1004 -d 10
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
2023-04-18 03:46:10.915 ERROR [2843478:2843478] Error 0 from RunTests(). Exiting. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DcgmProfTester.cpp:1170] [main]
$

Following are the logs:

$ dcgmproftester11 --no-dcgm-validation -t 1004 -d 10 --log-level DEBUG --log-file ./dcgm_log.txt
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
2023-04-19 01:44:54.500 ERROR [4187295:4187295] Error 0 from RunTests(). Exiting. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DcgmProfTester.cpp:1170] [main]
$
$
$ cat ./dcgm_log.txt
2023-04-19 01:44:13.157 INFO  [4187295:4187295] Skipping CreateDcgmGroups() since DCGM validation is disabled. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DcgmProfTester.cpp:117] [DcgmProfTester::CreateDcgmGroups]
2023-04-19 01:44:13.204 INFO  [4187295:4187295] Skipping CreateDcgmGroups() since DCGM validation is disabled. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/PhysicalGpu.cpp:1535] [DcgmNs::ProfTester::PhysicalGpu::CreateDcgmGroups]
2023-04-19 01:44:13.204 INFO  [4187295:4187295] Skipping CreateDcgmGroups() since DCGM validation is disabled. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/PhysicalGpu.cpp:1535] [DcgmNs::ProfTester::PhysicalGpu::CreateDcgmGroups]
2023-04-19 01:44:13.204 INFO  [4187295:4187295] Skipping WatchFields() since DCGM validation is disabled. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DcgmProfTester.cpp:193] [DcgmProfTester::WatchFields]
2023-04-19 01:44:13.440 INFO  [4187301:4187301] DCGM CudaContext Init completed successfully. Starting our TaskRunner. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:213] [CudaWorkerThread::Init]
2023-04-19 01:44:13.440 INFO  [4187301:4187301] Created thread named "" ID 3100958720 DcgmThread ptr 0x0x117da28 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/common/DcgmThread/DcgmThread.cpp:116] [DcgmThread::Start]
2023-04-19 01:44:13.440 DEBUG [4187301:4187306] Thread handle 3100958720 running [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/common/DcgmThread/DcgmThread.cpp:305] [DcgmThread::RunInternal]
2023-04-19 01:44:13.559 INFO  [4187302:4187302] DCGM CudaContext Init completed successfully. Starting our TaskRunner. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:213] [CudaWorkerThread::Init]
2023-04-19 01:44:13.559 INFO  [4187302:4187302] Created thread named "" ID 3100958720 DcgmThread ptr 0x0x117e588 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/common/DcgmThread/DcgmThread.cpp:116] [DcgmThread::Start]
2023-04-19 01:44:13.559 DEBUG [4187302:4187332] Thread handle 3100958720 running [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/common/DcgmThread/DcgmThread.cpp:305] [DcgmThread::RunInternal]
2023-04-19 01:44:13.921 ERROR [4187301:4187306] Unable to load cuda module DcgmProfTesterKernels.ptx. cuSt: 222 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:165] [CudaWorkerThread::LoadModule]
2023-04-19 01:44:13.921 ERROR [4187301:4187306] loadModule failed with -3 for 0 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:395] [CudaWorkerThread::AttachToCudaDeviceFromTaskThread]
2023-04-19 01:44:13.921 ERROR [4187301:4187301] AttachToCudaDevice(0) returned -3 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:227] [CudaWorkerThread::Init]
2023-04-19 01:44:13.921 ERROR [4187301:4187301] m_cudaWorker.Init failed with -3 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DistributedCudaContext.cpp:1598] [DcgmNs::ProfTester::DistributedCudaContext::RunTest]
2023-04-19 01:44:13.922 ERROR [4187302:4187332] Unable to load cuda module DcgmProfTesterKernels.ptx. cuSt: 222 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:165] [CudaWorkerThread::LoadModule]
2023-04-19 01:44:13.922 ERROR [4187302:4187332] loadModule failed with -3 for 0 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:395] [CudaWorkerThread::AttachToCudaDeviceFromTaskThread]
2023-04-19 01:44:13.922 ERROR [4187302:4187302] AttachToCudaDevice(0) returned -3 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/CudaWorkerThread.cpp:227] [CudaWorkerThread::Init]
2023-04-19 01:44:13.922 ERROR [4187302:4187302] m_cudaWorker.Init failed with -3 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DistributedCudaContext.cpp:1598] [DcgmNs::ProfTester::DistributedCudaContext::RunTest]
2023-04-19 01:44:54.500 INFO  [4187295:4187295] Skipping UnwatchFields() since DCGM validation is disabled. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DcgmProfTester.cpp:244] [DcgmProfTester::UnwatchFields]
2023-04-19 01:44:54.500 ERROR [4187295:4187295] Error 0 from RunTests(). Exiting. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmproftester/DcgmProfTester.cpp:1170] [main]
$

CUDA version on the host: 11.8.89

$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
$

starry91 avatar Apr 19 '23 05:04 starry91

@starry91,

Could you clarify which driver version is installed?

nikkon-dev avatar Apr 19 '23 18:04 nikkon-dev

@nikkon-dev, I am using 520.61.05

$ nvidia-smi
Wed Apr 19 15:42:54 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:5E:00.0 Off |                    0 |
| N/A   35C    P8    15W /  70W |      2MiB / 15360MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:D8:00.0 Off |                    0 |
| N/A   33C    P8    15W /  70W |      2MiB / 15360MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$

starry91 avatar Apr 19 '23 19:04 starry91

@starry91,

Could you replace the ptx file next to the dcgmproftester11 with the attached one and see if that works?

DcgmProfTesterKernels.ptx.zip

nikkon-dev avatar Apr 19 '23 21:04 nikkon-dev

I copied the DcgmProfTesterKernels.ptx file that you'd given and the test binary dcgmproftester11 is working successfully.

Does this .ptx file come with the dcgm package? If so, the package probably doesn't attach the latest file.

iprakhar22 avatar Apr 20 '23 09:04 iprakhar22

@nikkon-dev Can you please confirm if this a bug on DCGM side only or is it because of some mismatching config on our end?

starry91 avatar Apr 24 '23 10:04 starry91

@starry91,

This is an issue on the DCGM side. The ptx file was built with a newer version of the CUDA SDK than it should have been. This will be fixed in the next patch release.

nikkon-dev avatar Apr 24 '23 20:04 nikkon-dev

@nikkon-dev Can you please confirm if this affects only the dcgmProfTester or something else as well?

starry91 avatar Apr 25 '23 05:04 starry91

@starry91,

That affects the dcgmproftester only.

nikkon-dev avatar Apr 25 '23 06:04 nikkon-dev