Nik Konyuchenko
Nik Konyuchenko
Can you please check the dmesg logs and see if there is any information about the nvswitches? If there is, can you tell me how long it took to retrain...
@jiaxinonly, Could you provide debug logs for nv-hostengine and nvvs for the timeout issue? You may need to rerun the nv-hostengine with the `-f host.debug.log --log-level debug` and run `dcgmi...
@jiaxinonly, The dcgmi diag [has](https://github.com/NVIDIA/DCGM/blob/a33560c9c138c617f3ee6cb50df11561302e5743/dcgmlib/src/DcgmApi.cpp#L2991) a hardcoded 8-hour timeout in the communication protocol. Alternatively, you may use `dcgmi diag --iterations N`, to restart the diagnostic sequence N times.
@jiaxinonly, Please keep in mind that the parameter `targeted_power.test_duration=864000` sets the duration of each test to ten days, with a timeout of 8 hours. However, this value should not exceed...
@mamccorm I need to correct @bmarchant. The next major DCGM release will remove Cuda10, which is planned for later this year. However, the upcoming release in the DCGM 3.x branch...
Hi @mamccorm, I just wanted to clarify that the DCGM package doesn't rely on any Cuda packages. All the necessary components are linked or provided by the DCGM package itself...
@graywen24, Unfortunately, the dcgm_prometheus.py is not actively supported and is rather an example. We have the dcgm-exporter project that is meant to provide Prometheus metrics and is actively supported.
@graywen24, dcgm-exporter may work outside of the k8s environment, and in general, that's just a small binary written in Go. If the DCGM is installed on the machine, you do...
@mintchocohoco, The DCP metrics (1001...) are supported starting from Turing architecture. Pascal is not supported. For some metrics (1013,1014) you would need at least an Ampere GA100 chip.
That line runs scripts from the dcgmbuild/scripts/ directory one by one, and each of those scripts builds some 3rd party dependency. From your description, it's unclear which dependency failed to...