vcuda-controller untraceable GPU memory allocation

untraceable GPU memory allocation

Open zw0610 opened this issue 4 years ago • 13 comments

Describe the bug

When I was testing triton inference server 19.10, GPU memory usage increases when the following two functions are called:

cuCtxGetCurrent
cuModuleGetFunction

It seems when loading cuda module, some data is transmitted into GPU memory without any function calls described within Memory Manage.

Despite the fact that any following cuMemAlloc call will be prevented if untraceable GPU memory allocation has already surpassed the limit set by user, it still seems a flaw that the actual GPU memory usage may exceed limit.

Environment OS: Linux kube-node-zw 3.10.0-1062.18.1.el7.x86_64 # 1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

GPU Info: NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2

Apr 02 '20 08:04 zw0610

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

Apr 03 '20 01:04 mYmNeo

@mYmNeo Thank you so much for addressing the question. I wish your resolution will overcome the benchmark issue and be public to the community.

Apr 07 '20 09:04 zw0610

Hi, @mYmNeo , I also found this problem, like I tried in my own small project: https://github.com/hyc3z/cuda-w-mem-watcher I set the limit to 2147483648 , which is exactly 2GB however, when I watch nvidia-smi on real host, It seems that when I run tensorflow samples, it will use more than 2.5GB before triggering OOM caused by returning CUDA_ERROR_OUT_OF_MEMORY I tried setting limit to 1GB, and still there are 500MB more. Then I tried not allowing any allocation through memory driver api, After some initializing procedures, it still consumed about 250M of memory before the process going down.

May 20 '20 01:05 hyc3z

https://github.com/hyc3z/cuda-w-mem-watcher

Can you provide the driver apis which your program used?

May 20 '20 10:05 mYmNeo

@mYmNeo I'm using the tensorflow/tensorflow:latest-gpu-py3 docker image, which comes with python 3.6.9 and tensorflow-gpu 2.1.0. The test script I use is https://github.com/tensorflow/benchmarks

May 20 '20 12:05 hyc3z

@mYmNeo All the things that my small project do is just replacing the libcuda.so.1, since tensorflow is using dlopen to load libraries, it's not going to work setting LD_PRELOAD environment and replace symbols.

May 20 '20 12:05 hyc3z

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release?

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

Sep 03 '20 09:09 nlnjnj

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

Your problem is not the memory allocation. The log shows that cuInit error no CUDA-capable device is detected

Sep 04 '20 01:09 mYmNeo

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

Your problem is not the memory allocation. The log shows that cuInit error no CUDA-capable device is detected

Yes, but when I run only one trainer(one pid) or set the request cuda code to 100% this case can run normally, I think this error log may not be exactly describe what the root cause.

Sep 04 '20 01:09 nlnjnj

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)

Your problem is not the memory allocation. The log shows that cuInit error no CUDA-capable device is detected

Yes, but when I run only one trainer(one pid) or set the request cuda code to 100% this case can run normally, I think this error log may not be exactly describe what the root cause.

Did you try run 2 trainer with one single card? And any error occurred?

Sep 04 '20 07:09 mYmNeo

The GPU memory stores not only data but also code, and state in it. The NVIDIA API of Memory Manage just describe the data part. If you digs deeply, you'll find that the GPU memory increases after calling cuInit function. We do have a resolution in such scenario, but it benchmark result is not very good, so the code isn't submitted in this repo.

I came across a similar issue when I try to run 2 trainer workers in the same vgpu (this only happens when the request cuda code is not 100%), So is this problem cased by the same reason, and is there any plan for this resolution release or even pre-release?
(pid=26034) create tf session
(pid=26030) create tf session
(pid=26034) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26034) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26034) PC: @                0x0 (unknown)
(pid=26034) *** SIGABRT (@0x65b2) received by PID 26034 (TID 0x7f91c388d740) from PID 26034; stack trace: ***
(pid=26034)     @     0x7f91c346a8a0 (unknown)
(pid=26034)     @     0x7f91c30a5f47 gsignal
(pid=26034)     @     0x7f91c30a78b1 abort
(pid=26034)     @     0x7f91c1cb1441 google::LogMessage::Flush()
(pid=26034)     @     0x7f91c1cb1511 google::LogMessage::~LogMessage()
(pid=26034)     @     0x7f91c1c8ede9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f91c19f57c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f91c19f581a std::unique_ptr<>::~unique_ptr()
(pid=26034)     @     0x7f91c30aa0f1 (unknown)
(pid=26034)     @     0x7f91c30aa1ea exit
(pid=26034)     @     0x7f7e0c2ff497 initialization
(pid=26034)     @     0x7f91c3467827 __pthread_once_slow
(pid=26034)     @     0x7f7e0c300e3b cuInit
(pid=26030) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=26030) *** Aborted at 1599119982 (unix time) try "date -d @1599119982" if you are using GNU date ***
(pid=26030) PC: @                0x0 (unknown)
(pid=26030) *** SIGABRT (@0x65ae) received by PID 26030 (TID 0x7eff49098740) from PID 26030; stack trace: ***
(pid=26030)     @     0x7eff48c758a0 (unknown)
(pid=26030)     @     0x7eff488b0f47 gsignal
(pid=26030)     @     0x7eff488b28b1 abort
(pid=26030)     @     0x7eff474bc441 google::LogMessage::Flush()
(pid=26030)     @     0x7eff474bc511 google::LogMessage::~LogMessage()
(pid=26030)     @     0x7eff47499de9 ray::RayLog::~RayLog()
(pid=26034)     @     0x7f7e97d55da0 cuInit
(pid=26030)     @     0x7eff472007c5 ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=26034)     @     0x7f7e97c8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=26030)     @     0x7eff4720081a std::unique_ptr<>::~unique_ptr()
(pid=26030)     @     0x7eff488b50f1 (unknown)
Your problem is not the memory allocation. The log shows that cuInit error no CUDA-capable device is detected
Yes, but when I run only one trainer(one pid) or set the request cuda code to 100% this case can run normally, I think this error log may not be exactly describe what the root cause.
Did you try run 2 trainer with one single card? And any error occurred?

Yes the error is shows that cuInit error no CUDA-capable device is detected, and I recently occur this error even on I runing one trainer.

For more details you could contact me via wechat: nlnjnj

Sep 04 '20 07:09 nlnjnj

@nlnjnj I had similar errors before. However, I believe such issue may be caused merely by hijacking CUDA API. You might take a test by cuMemAlloc small piece of data, preventing the program from exceeding the memory limit. In my experience, it would still occur.

Sep 04 '20 07:09 zw0610

@mYmNeo All the things that my small project do is just replacing the libcuda.so.1, since tensorflow is using dlopen to load libraries, it's not going to work setting LD_PRELOAD environment and replace symbols.

Hi, do you replace libcuda.so.1 in Tensorflow successfully? If do, can you share your way? Thanks!

Dec 06 '21 08:12 Huoyuan100861

vcuda-controller vcuda-controller copied to clipboard

untraceable GPU memory allocation

vcuda-controller
vcuda-controller copied to clipboard