DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

a question about dcgm policy listening for xid

Open BetaZYN opened this issue 10 months ago • 2 comments

If I register an XID through DCGM's policy and listen, when a certain XID (for example, 79) occurs, will the policy keep reporting that XID until it recovers, or will it only report it once? I look forward to your reply

BetaZYN avatar Apr 23 '24 09:04 BetaZYN

@BetaZYN,

It depends on how you read the XIDs. Each XID event is stored with its timestamp, and there is an API to get either the latest value in the TSDB or values since a specific timestamp. The dcgmi cli tool uses only the last value in the TSDB, so it may look like a "sticky" XID until another XID is reported. If you use the API directly, you may get all XIDs that happened within the last minute, for example.

Currently, the DCGM version can't report XID 79, 119, and 120 due to limitations in the NVML library. Our team is working to fix this.

nikkon-dev avatar Apr 24 '24 02:04 nikkon-dev

@nikkon-dev , Thank you for your reply.

  1. Which specific API are you referring to that can get xid and timestamp?
  2. Our current use case is as follows: // set group 2 policy condition with XID errors dcgmi policy -g 2 --set 0,0 -x // register group2 for policy updates dcgmi policy -g 2 --reg If a GPU generates an XID during listening, will this XID be repeatedly reported until a new XID appears or until this XID disappears?

BetaZYN avatar Apr 24 '24 04:04 BetaZYN