Add advanced health check configuration to config file
Enhanced Error-handling config
Current State
See https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing
The NVIDIA GPU Device Plugin
We register for NVML Events of type nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError
We treat the following XIDs as non-fatal errors:
| XID | Description |
|---|---|
| 13 | Graphics Engine Exception |
| 31 | GPU memory page fault |
| 43 | GPU stopped processing |
| 45 | Preemptive cleanup, due to previous errors |
| 68 | Video processor exception |
| 109 | Context Switch Timeout Error |
We allow additional Xids to be specified in the DP_DISABLE_HEALTHCHECKS envvar with the following logic:
- If the value is
xidsorallwe disable healthchecks entirely. - A comma-separated list of numeric XIDs to ignore: e.g.
109,68
The GKE Device Plugin
See https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/0509b1f9f4b9a357b44ba65e7b508ded8bd5ecf0/pkg/gpu/nvidia/health_check/health_checker.go#L41
By default the following error is checked:
| XID | Description |
|---|---|
| 48 | Double-bit ECC Error |
The XID_CONFIG envvar is used to specifiy a comma-separated list of additional XIDs to treat as critical.
Proposal
Add the following config section:
version: v1
health:
disabled: false
eventTypes: [EventTypeXidCriticalError, EventTypeDoubleBitEccError, EventTypeSingleBitEccError]
ignoredXIDs: [13, 31, 43, 45, 68]
criticalXIDs: all
GKE defaults:
version: v1
health:
disabled: false
eventTypes: [EventTypeXidCriticalError]
ignoredXIDs: []
criticalXIDs: [48]