Evan Lezar
Evan Lezar
@yizhouv5 the device plugin reacts to a subset of NVML events that are associated with critical Xid errors: https://github.com/NVIDIA/k8s-device-plugin/blob/7e6e3765be7414717b8a8e3972cd936cccc9384a/internal/rm/health.go#L94 We also have a filter that skips a list of errors...
> If I use nvidia-smi drain to manually disable one GPU, no events will appear in the nvidia-device-plugin log. The label nvidia.com/gpu.count value of the node is updated in the...
> As far as I can tell, NVIDIA's CDI-spec generator tool does not have any analogous settings. It is unclear to me whether they would want to carry such configurability...
@sgopinath1 in order to add support for AMD devices, I would recommend considering generating CDI specifications for these devices. This would have the advantage that they will be usable in...
Would the following simple mapping be acceptable: * The `--gpus=all` flag maps to `--device=vendor.com/gpu=all` * The `--gpus={{ .Count }}` flag maps to: `--device=vendor.com/gpu=0 --device=vendor.com/gpu=1 ... --device=vendor.com/gpu={{ .Count - 1 }}`...
@yeahdongcn aliases were part of the original proposal but removed to simplify the API once we started actively developing this. It would defintely be worth including again. As a matter...
@yeahdongcn I was just thinking about this and realized that if you generate two specs with `nvidia-ctk cdi generate` (the final CLI as of the `v1.12.0` relase) then both device...
Any CDI client (consumer) such as podman, crio, containerd, or the nvidia-container-runtime in CDI mode loads all spec files to determine what valid CDI devices exist. Any of these will...
@yeahdongcn I have just done a quick test myself, and the duplicate `all` devices in the two specs generated by the commands above will cause issues when injecting devices. This...