Nik Konyuchenko comments

Results 96 comments of


                                            Nik Konyuchenko

Couldn't load a definition for ShutdownPlugin in plugin libSoftware.so

Can you please check the dmesg logs and see if there is any information about the nvswitches? If there is, can you tell me how long it took to retrain...

Couldn't load a definition for ShutdownPlugin in plugin libSoftware.so

@jiaxinonly, Could you provide debug logs for nv-hostengine and nvvs for the timeout issue? You may need to rerun the nv-hostengine with the `-f host.debug.log --log-level debug` and run `dcgmi...

Couldn't load a definition for ShutdownPlugin in plugin libSoftware.so

@jiaxinonly, The dcgmi diag [has](https://github.com/NVIDIA/DCGM/blob/a33560c9c138c617f3ee6cb50df11561302e5743/dcgmlib/src/DcgmApi.cpp#L2991) a hardcoded 8-hour timeout in the communication protocol. Alternatively, you may use `dcgmi diag --iterations N`, to restart the diagnostic sequence N times.

Couldn't load a definition for ShutdownPlugin in plugin libSoftware.so

@jiaxinonly, Please keep in mind that the parameter `targeted_power.test_duration=864000` sets the duration of each test to ten days, with a timeout of 8 hours. However, this value should not exceed...

Removal of dependencies on cuda v10

@mamccorm I need to correct @bmarchant. The next major DCGM release will remove Cuda10, which is planned for later this year. However, the upcoming release in the DCGM 3.x branch...

Removal of dependencies on cuda v10

Hi @mamccorm, I just wanted to clarify that the DCGM package doesn't rely on any Cuda packages. All the necessary components are linked or provided by the DCGM package itself...

1:2.3.4 version dcgm_prometheus.py error AttributeError: 'DcgmPrometheus' object has no attribute 'm_publishFieldIds'

@graywen24, Unfortunately, the dcgm_prometheus.py is not actively supported and is rather an example. We have the dcgm-exporter project that is meant to provide Prometheus metrics and is actively supported.

1:2.3.4 version dcgm_prometheus.py error AttributeError: 'DcgmPrometheus' object has no attribute 'm_publishFieldIds'

@graywen24, dcgm-exporter may work outside of the k8s environment, and in general, that's just a small binary written in Go. If the DCGM is installed on the machine, you do...

Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded

@mintchocohoco, The DCP metrics (1001...) are supported starting from Turing architecture. Pascal is not supported. For some metrics (1013,1014) you would need at least an Ampere GA100 chip.

makefile for test7 missing

That line runs scripts from the dcgmbuild/scripts/ directory one by one, and each of those scripts builds some 3rd party dependency. From your description, it's unclear which dependency failed to...