DCGM
DCGM copied to clipboard
a question about dcgm policy listening for xid
If I register an XID through DCGM's policy and listen, when a certain XID (for example, 79) occurs, will the policy keep reporting that XID until it recovers, or will it only report it once? I look forward to your reply
@BetaZYN,
It depends on how you read the XIDs. Each XID event is stored with its timestamp, and there is an API to get either the latest value in the TSDB or values since a specific timestamp. The dcgmi cli tool uses only the last value in the TSDB, so it may look like a "sticky" XID until another XID is reported. If you use the API directly, you may get all XIDs that happened within the last minute, for example.
Currently, the DCGM version can't report XID 79, 119, and 120 due to limitations in the NVML library. Our team is working to fix this.
@nikkon-dev , Thank you for your reply.
- Which specific API are you referring to that can get xid and timestamp?
- Our current use case is as follows: // set group 2 policy condition with XID errors dcgmi policy -g 2 --set 0,0 -x // register group2 for policy updates dcgmi policy -g 2 --reg If a GPU generates an XID during listening, will this XID be repeatedly reported until a new XID appears or until this XID disappears?