glowkey
glowkey
The commits are not showing up as verified. I can see they are signed-off but not signed. The workflow prevents merging unsigned commits.
DCGM-Exporter can monitor most DCGM fields from this page, including many ECC errors: https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html#c.DCGM_FI_DEV_ECC_CURRENT Please see the docs for ways to customize which fields are monitored.
@vishpat this repo can accept PRs. Please create a PR from the above.
Injecting errors into DCGM does not inject errors into the driver, NVML, or any other layer lower than DCGM itself. If your pytorch code integrates with DCGM to determine GPU...
The next major version of DCGM-Exporter planned for sometime in the next few months will have a capability like this. Stay tuned.
Yes, you can now add `DCGM_EXP_GPU_HEALTH_STATUS, gauge, GPU health status` to the list of watched metrics.
The commit needs to be signed (git commit -S) before it can be merged.
Check out this link: https://docs.github.com/authentication/managing-commit-signature-verification/about-commit-signature-verification You'll notice that this commit does not have the "Verified" tag.
I just took a moment to test these changes on an 8 GPU system with MIG enabled and unfortunately the panels were empty. I'm far from a Grafana expert so...
This is a very strange error. You say that you've recompiled dcgm-exporter? Does the error happen with the official versions? It's very hard to determine what might be going wrong...