Evan Lezar

Results 50 issues of Evan Lezar

### What is the problem you're trying to solve At present, the `--gpus` flag implementation is out of date and constructs a call to the [`nvidia-container-cli` directly](https://github.com/containerd/containerd/blob/ba0a05aab81b67c28cc77a9b3364da5948998ad2/contrib/nvidia/nvidia.go#L90). This does not...

kind/feature
area/runtime

This change maps ctr --gpus requests to CDI device requests. This is done by mapping --gpus ID to a nvidia.com/gpu=ID device request. This removes the dependence on the nvidia-container-cli and...

size/L
area/client

This change switches to using CDI to handle the --gpus flag. This removes the custom implementation that invoked the nvidia-container-cli directly. This mechanism does not align with existing implementations. See...

When enumerating devices, a single device that has errors causes GFD to fail -- skipping any remaining devices. This manifests as errors similar to: ``` E1105 17:45:39.017442 1 main.go:110] error...

Although the `cdi.featureFlags` were added as CLI arguments to the device plugin in #1495, these were not passed to the nvcdi library construction. This change ensures that these are properly...

cherry-pick/release-0.18

In cases where registering events for a device are not supported, we should not mark the device as unhealthy, but skip the device instead.

Before starting a more serious refactor as in #1508 it may be useful to: 1. Add some basic unit testing for health checking behaviour 2. Perform minor code reorganisation so...

The NVIDIA Container Toolkit v1.17.5 included support for an `enable-cuda-compat` hook. Since we may be running in a situation where a host-installed NVIDIA Container Toolkit is used, this hook was...

# Enhanced Error-handling config ## Current State See https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing ### The NVIDIA GPU Device Plugin We register for NVML Events of type `nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError` We treat the...