k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

update nodelabel for config-manger k8s-device-plugin continuing printing error msg, not stop

Open aphrodite1028 opened this issue 10 months ago • 1 comments

if i use nvidia.com/device-plugin.config to set config, just set config0 and after minutes set config1.

k8d-device-plugin continuing print msg, not stop

health.go:142] Error waiting for event: ERROR_UNKNOWN; Marking all devices as unhealthy

  • k8s-device-plugin verison is v0.15.0-rc.2
  • gpu driver is 535.129.03
  • GPU Info Tesla P100-PCIE-16GB

and I found gpu driver 470.129.06 not have set_default_device_pinned_mem_limit command param if has gpu driver least limit for gpu mem limit and Is it possible to monitor the GPU utilization for each MPS client independently?

aphrodite1028 avatar Apr 22 '24 08:04 aphrodite1028

and if we update k8s-device-plugin version ,for example, from 0.15.0 to 0.16.0.rc , some cuda processing instance already running in host machine ad docker. after nvidia-cuda-mps-control container rerunning, nvidia-cuda-mps-server not starting. when i request running cuda processing. error like below

CUDA failure 806: unrecognized error code ; GPU=0

and after i remove all mps client process, deploy a new mps client pod, nvidia-cuda-mps-server start success

if you @elezar can help me what i need to do ? thanks

aphrodite1028 avatar Apr 28 '24 10:04 aphrodite1028

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Jul 28 '24 04:07 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Aug 27 '24 04:08 github-actions[bot]