vcuda-controller Problems caused by launching multiple pods at the same time

Problems caused by launching multiple pods at the same time

Open rv64m opened this issue 3 years ago • 6 comments

Why do I get an error when I start multiple GPU-resource pods simultaneously (concurrently) using vcuda?

In vcuda loader.c, I add ferror to print errno related error message, I get it

But when I start the pods sequentially, I don't have this problem. So I guess it may be caused by a gap between the kubelet startup container and the gpu-manager placing the libcuda.so file.

Feb 28 '22 02:02 rv64m

cc @mYmNeo

Feb 28 '22 07:02 rv64m

As if vcuda would copy lib to container after container start up. When start multiple GPU-resource pods simultaneously, this action is not fast enough. You can try to modify your command sh -c sleep 5 && your command

Mar 21 '22 12:03 rainfd

As if vcuda would copy lib to container after container start up. When start multiple GPU-resource pods simultaneously, this action is not fast enough. You can try to modify your command sh -c sleep 5 && your command

Oh, Thanks rainfd, I knew this solution, but I felt this way is a hat trick.

Mar 23 '22 01:03 rv64m

What's the version of gpu-manager? I've fixed a problem in master branch but not released a image

Mar 23 '22 12:03 mYmNeo

@mYmNeo my version is v1.0.4. What is the commit?

Mar 24 '22 02:03 rainfd

@mYmNeo my version is v1.0.4. What is the commit?

https://github.com/tkestack/gpu-manager/pull/130

Mar 25 '22 02:03 mYmNeo

vcuda-controller vcuda-controller copied to clipboard

Problems caused by launching multiple pods at the same time

vcuda-controller
vcuda-controller copied to clipboard