vcuda-controller
vcuda-controller copied to clipboard
Problems caused by launching multiple pods at the same time
Why do I get an error when I start multiple GPU-resource pods simultaneously (concurrently) using vcuda?
In vcuda loader.c, I add ferror to print errno related error message, I get it
But when I start the pods sequentially, I don't have this problem. So I guess it may be caused by a gap between the kubelet startup container and the gpu-manager placing the libcuda.so file.
cc @mYmNeo
As if vcuda would copy lib to container after container start up. When start multiple GPU-resource pods simultaneously, this action is not fast enough. You can try to modify your command sh -c sleep 5 && your command
As if vcuda would copy lib to container after container start up. When start multiple GPU-resource pods simultaneously, this action is not fast enough. You can try to modify your command
sh -c sleep 5 && your command
Oh, Thanks rainfd, I knew this solution, but I felt this way is a hat trick.
What's the version of gpu-manager? I've fixed a problem in master branch but not released a image
@mYmNeo my version is v1.0.4. What is the commit?
@mYmNeo my version is v1.0.4. What is the commit?
https://github.com/tkestack/gpu-manager/pull/130