DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

For profiling metrics, dcgmi reports an error message: The third-party Profiling module returned an unrecoverable error

Open jaslip opened this issue 2 years ago • 6 comments

The problem is that dcgmi can not query profiling metrics.

  • Tesla V100 GPU. DCGM version: 3.1.3

  • nvidia-smi :

Tue Jan 10 16:00:42 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |

-dcgmi modules -l It shows Profiling module is loaded. +===========+====================+==================================================+ | Module ID | Name | State | +-----------+--------------------+--------------------------------------------------+ | 0 | Core | Loaded | | 1 | NvSwitch | Loaded | | 2 | VGPU | Not loaded | | 3 | Introspection | Not loaded | | 4 | Health | Not loaded | | 5 | Policy | Not loaded | | 6 | Config | Not loaded | | 7 | Diag | Not loaded | | 8 | Profiling | Loaded | |

  • dcgmi query profiling metrics: SM occupancy

dcgmi dmon -e 1002,1003 Error setting watches. Result: The third-party Profiling module returned an unrecoverable error

  • nv-hostengine debug log:

2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] AddMetricToConfig was successful for deviceIndex 0. metricName sm__warps_active.avg.pct_of_peak_sustained_elapsed [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:905] [DcgmLopConfig::AddMetricToConfig] 2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] CreateConfigAndPrefixImages was successful for deviceIndex 0. imageSize 172 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:348] [DcgmLopConfig::CreateConfigAndPrefixImages] 2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] counterDataImageSize 15316 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:237] [DcgmLopConfig::InitializeWithMetrics] 2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] InitializeCounterData was successful for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:96] [DcgmLopConfig::InitializeCounterData] 2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] InitializeWithMetrics was successful for deviceIndex 0. 2 metrics [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:249] [DcgmLopConfig::InitializeWithMetrics] 2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] Successfully added 1 metric groups. [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopGpu.cpp:166] [DcgmLopGpu::InitializeWithMetrics] 2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] Enabling metrics for gpuId 0 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2588] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ReconfigureLopGpu] 2023-01-09 17:34:34.535 ERROR [933520:933752] [[Profiling]] [PerfWorks] Got status 1 from NVPW_DCGM_PeriodicSampler_BeginSession() on deviceIndex 0 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopGpu.cpp:351] [DcgmLopGpu::BeginSession] 2023-01-09 17:34:34.535 ERROR [933520:933752] [[Profiling]] EnableMetrics returned -37 The third-party Profiling module returned an unrecoverable error [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2591] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ReconfigureLopGpu] 2023-01-09 17:34:34.535 ERROR [933520:933752] [[Profiling]] Unable to reconfigure LOP metric watches for GpuId {0} [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2680] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ChangeWatchState]

jaslip avatar Jan 10 '23 08:01 jaslip

I've seen this before. It is usually caused by other running profilers, such as ncu nvprof or another DCGM instance in embedded mode.

FindHao avatar Jan 17 '23 14:01 FindHao

I've seen this before. It is usually caused by other running profilers, such as ncu nvprof or another DCGM instance in embedded mode.

Thanks bro! I encountered exactly the same error when running dcgmi dmon -e 1011 on the host machine after I had started a [DCGM-Exporter](https://github.com/NVIDIA/dcgm-exporter) docker container. The command goes back to normal after stopping the DCGM-Exporter container.

solrex avatar Dec 12 '23 03:12 solrex

The DCP metrics (1001-1014) require a unique lock in the same hardware used by the Nvidia profiler. This means that two different processes cannot access the same metrics. The nv-hostengine, the dcgm-exporter that runs the embedded hostengine, and the profiler are mutually exclusive. There are several ways to avoid locks:

  1. The dcgmi provides pause/resume functionality that can be used to stop DCP metrics gathering temporarily. This would allow run profiler.'
  2. The dcgm-exporter can connect to a standalone nv-hostengine instead of running the embedded one - the -r command line argument.

nikkon-dev avatar Dec 12 '23 07:12 nikkon-dev

The DCP metrics (1001-1014) require a unique lock in the same hardware used by the Nvidia profiler. This means that two different processes cannot access the same metrics. The nv-hostengine, the dcgm-exporter that runs the embedded hostengine, and the profiler are mutually exclusive. There are several ways to avoid locks:

  1. The dcgmi provides pause/resume functionality that can be used to stop DCP metrics gathering temporarily. This would allow run profiler.'
  2. The dcgm-exporter can connect to a standalone nv-hostengine instead of running the embedded one - the -r command line argument.

If these two hostengines one is in a node,another is in the vm which is virtualized on this node,does this case works? Thanks.

lynchyo avatar Feb 21 '24 08:02 lynchyo

@lynchyo, It is only possible to support this if the GPU is in passthrough mode (meaning that the host does not see or use it). The limitation is due to hardware, not the driver/dcgm/profiling software.

nikkon-dev avatar Feb 21 '24 20:02 nikkon-dev

@lynchyo, It is only possible to support this if the GPU is in passthrough mode (meaning that the host does not see or use it). The limitation is due to hardware, not the driver/dcgm/profiling software.

Get it, thank you very much.

lynchyo avatar Feb 22 '24 02:02 lynchyo