gpu-manager
gpu-manager copied to clipboard
使用gpu-manager调度的pod起不来,会报Segmentation fault (core dumped) ,cuda版本是11.1
so which gpu card u used?
I had the same problem tesla p4
Maybe your application call nvmlInit() to init CUDA enviroment.
Try add this code at the start of your python entrypoint to find out which function makes seg fault. ( if your application is writen in python)
import faulthandler
faulthandler.enable()
And if nvmlInit() make this happen, these code can make it througth
from ctypes import CDLL, c_int, byref
nvml_h = CDLL("libnvidia-ml.so.1")
nvml_h.nvmlInit_v2()