k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Add advanced health check configuration to config file

Open elezar opened this issue 5 months ago • 3 comments

Enhanced Error-handling config

Current State

See https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing

The NVIDIA GPU Device Plugin

We register for NVML Events of type nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError

We treat the following XIDs as non-fatal errors:

XID Description
13 Graphics Engine Exception
31 GPU memory page fault
43 GPU stopped processing
45 Preemptive cleanup, due to previous errors
68 Video processor exception
109 Context Switch Timeout Error

We allow additional Xids to be specified in the DP_DISABLE_HEALTHCHECKS envvar with the following logic:

  • If the value is xids or all we disable healthchecks entirely.
  • A comma-separated list of numeric XIDs to ignore: e.g. 109,68

The GKE Device Plugin

See https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/0509b1f9f4b9a357b44ba65e7b508ded8bd5ecf0/pkg/gpu/nvidia/health_check/health_checker.go#L41

By default the following error is checked:

XID Description
48 Double-bit ECC Error

The XID_CONFIG envvar is used to specifiy a comma-separated list of additional XIDs to treat as critical.

Proposal

Add the following config section:

version: v1
health:
  disabled: false
  eventTypes: [EventTypeXidCriticalError, EventTypeDoubleBitEccError, EventTypeSingleBitEccError]
  ignoredXIDs: [13, 31, 43, 45, 68]
  criticalXIDs: all

GKE defaults:

version: v1
health:
  disabled: false
  eventTypes: [EventTypeXidCriticalError]
  ignoredXIDs: []
  criticalXIDs: [48]

elezar avatar Aug 01 '25 10:08 elezar