k8s-device-plugin
k8s-device-plugin copied to clipboard
update nodelabel for config-manger k8s-device-plugin continuing printing error msg, not stop
if i use nvidia.com/device-plugin.config to set config, just set config0 and after minutes set config1.
k8d-device-plugin continuing print msg, not stop
health.go:142] Error waiting for event: ERROR_UNKNOWN; Marking all devices as unhealthy
- k8s-device-plugin verison is v0.15.0-rc.2
- gpu driver is 535.129.03
- GPU Info Tesla P100-PCIE-16GB
and I found gpu driver 470.129.06 not have set_default_device_pinned_mem_limit command param if has gpu driver least limit for gpu mem limit and Is it possible to monitor the GPU utilization for each MPS client independently?
and if we update k8s-device-plugin version ,for example, from 0.15.0 to 0.16.0.rc , some cuda processing instance already running in host machine ad docker. after nvidia-cuda-mps-control container rerunning, nvidia-cuda-mps-server not starting. when i request running cuda processing. error like below
CUDA failure 806: unrecognized error code ; GPU=0
and after i remove all mps client process, deploy a new mps client pod, nvidia-cuda-mps-server start success
if you @elezar can help me what i need to do ? thanks
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.