aws-virtual-gpu-device-plugin
aws-virtual-gpu-device-plugin copied to clipboard
Pod keeps restarting when two containers share GPU
I am trying to run Nvidia-triton containers for model inferencing, however when more than 1 container is allocated to the same node, one of the container 1) Either fails to load the model onto the GPU. 2) Keep on restarting.
Any suggestions on how this can be solved?