Nik Konyuchenko comments

Results 96 comments of


                                            Nik Konyuchenko

Extracting errors and bugs in k8s environment

Hi @guleng, The dcgm-exporter acquires the information related to pod resources on **every metrics request** via POD API - that very /var/lib/kubelet/pod-resources socket. We do not cache or store that...

Errors in nv-hostengine log

@itzsimpl, Can you please check the dmesg messages and confirm if you are using the GSP driver?

a question about dcgm policy listening for xid

@BetaZYN, It depends on how you read the XIDs. Each XID event is stored with its timestamp, and there is an API to get either the latest value in the...

DCGM_FI_PROF_GR_ENGINE_ACTIVE and MIG

Could you provide the nvidia-smi output for 2g.20gb and 3g.40gb MIG configurations? Such GR_ACTIVE utilization may happen if you create Compute Instances that do not occupy the whole MIG Instance....

DCGM_FI_PROF_GR_ENGINE_ACTIVE and MIG

Hello @neggert, I'd like to suggest running the dcgmproftester12 tool from the DCGM package to create a synthetic load on the node (not pod) that has MIG enabled. I also...

Removal of dependencies on cuda v10

Cuda10 was removed from the OSS builds.

For profiling metrics, dcgmi reports an error message: The third-party Profiling module returned an unrecoverable error

The DCP metrics (1001-1014) require a unique lock in the same hardware used by the Nvidia profiler. This means that two different processes cannot access the same metrics. The nv-hostengine,...

For profiling metrics, dcgmi reports an error message: The third-party Profiling module returned an unrecoverable error

@lynchyo, It is only possible to support this if the GPU is in passthrough mode (meaning that the host does not see or use it). The limitation is due to...

Does DCGM supports creating groups of GPU from different hosts?

@deferen2, Yes, groups are limited to a single nv-hostengine instance. Internally, groups are just a list of entities local to the hostengine without any special logic attached to it. WBR,...

Is there a way to disallow sharing of MIG devices?

@starry91, I'm sorry, but this place may not be the best place to ask about MIG or driver configuration. DCGM doesn't have any control over those matters.