Nik Konyuchenko
Nik Konyuchenko
Hi @guleng, The dcgm-exporter acquires the information related to pod resources on **every metrics request** via POD API - that very /var/lib/kubelet/pod-resources socket. We do not cache or store that...
@itzsimpl, Can you please check the dmesg messages and confirm if you are using the GSP driver?
@BetaZYN, It depends on how you read the XIDs. Each XID event is stored with its timestamp, and there is an API to get either the latest value in the...
Could you provide the nvidia-smi output for 2g.20gb and 3g.40gb MIG configurations? Such GR_ACTIVE utilization may happen if you create Compute Instances that do not occupy the whole MIG Instance....
Hello @neggert, I'd like to suggest running the dcgmproftester12 tool from the DCGM package to create a synthetic load on the node (not pod) that has MIG enabled. I also...
Cuda10 was removed from the OSS builds.
The DCP metrics (1001-1014) require a unique lock in the same hardware used by the Nvidia profiler. This means that two different processes cannot access the same metrics. The nv-hostengine,...
@lynchyo, It is only possible to support this if the GPU is in passthrough mode (meaning that the host does not see or use it). The limitation is due to...
@deferen2, Yes, groups are limited to a single nv-hostengine instance. Internally, groups are just a list of entities local to the hostengine without any special logic attached to it. WBR,...
@starry91, I'm sorry, but this place may not be the best place to ask about MIG or driver configuration. DCGM doesn't have any control over those matters.