Evan Lezar comments

Results 419 comments of


                                            Evan Lezar

How to trigger gpu failure, the gpu count of node's allocatable field will be dynamically decrease

@yizhouv5 the device plugin reacts to a subset of NVML events that are associated with critical Xid errors: https://github.com/NVIDIA/k8s-device-plugin/blob/7e6e3765be7414717b8a8e3972cd936cccc9384a/internal/rm/health.go#L94 We also have a filter that skips a list of errors...

How to trigger gpu failure, the gpu count of node's allocatable field will be dynamically decrease

> If I use nvidia-smi drain to manually disable one GPU, no events will appear in the nvidia-device-plugin log. The label nvidia.com/gpu.count value of the node is updated in the...

re-implement `--gpus` flag using CDI (was "AMD GPU support")

> As far as I can tell, NVIDIA's CDI-spec generator tool does not have any analogous settings. It is unclear to me whether they would want to carry such configurability...

re-implement `--gpus` flag using CDI (was "AMD GPU support")

@sgopinath1 in order to add support for AMD devices, I would recommend considering generating CDI specifications for these devices. This would have the advantage that they will be usable in...

re-implement `--gpus` flag using CDI (was "AMD GPU support")

Would the following simple mapping be acceptable: * The `--gpus=all` flag maps to `--device=vendor.com/gpu=all` * The `--gpus={{ .Count }}` flag maps to: `--device=vendor.com/gpu=0 --device=vendor.com/gpu=1 ... --device=vendor.com/gpu={{ .Count - 1 }}`...

gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI)

/cc @elezar

[Feature Request] Add `Aliases` in Device spec

@yeahdongcn aliases were part of the original proposal but removed to simplify the API once we started actively developing this. It would defintely be worth including again. As a matter...

[Feature Request] Add `Aliases` in Device spec

@yeahdongcn I was just thinking about this and realized that if you generate two specs with `nvidia-ctk cdi generate` (the final CLI as of the `v1.12.0` relase) then both device...

[Feature Request] Add `Aliases` in Device spec

Any CDI client (consumer) such as podman, crio, containerd, or the nvidia-container-runtime in CDI mode loads all spec files to determine what valid CDI devices exist. Any of these will...

[Feature Request] Add `Aliases` in Device spec

@yeahdongcn I have just done a quick test myself, and the duplicate `all` devices in the two specs generated by the commands above will cause issues when injecting devices. This...