deepdetect
deepdetect copied to clipboard
RAM issue cuda 10.2 vs cuda 10.1
Configuration
The following issue has been observe using a 1080, 1080 Ti and a Titan X.
- Version of DeepDetect:
- [ ] Locally compiled on:
- [ ] Ubuntu 18.04 LTS
- [ ] Other:
- [X] Docker CPU
- [ ] Docker GPU
- [ ] Amazon AMI
- [ ] Locally compiled on:
- Commit (shown by the server when starting): 4ce9277db88a6b9565353ca830410948298947ec
Your question / the problem you're facing:
Since using a DD version with cuda 10.2 I have noticed a high increase in memory usage. Compiling the same DD version with cuda 10.2 and cuda 10.1 allowed to observe an factor 3 increase of RAM usage for a googlenet.
Error message (if any) / steps to reproduce the problem:
The following scrip allows to build DD with cuda 10.2 and cuda 10.2 using the latest commit.
git clone https://github.com/jolibrain/deepdetect.git
cd deepdetect/docker
# Transform all cuda 10.2 to cuda 10.1 in a new DOcker file
cp gpu.Dockerfile gpu_10.1.Dockerfile
sed -i 's/cuda:10.2/cuda:10.1/' gpu_10.1.Dockerfile
# Go back to deepdetect folder
cd ..
# Build 10.2
DOCKER_BUILDKIT=1 docker build -t jolibrain/deepdetect_gpu --no-cache -f docker/gpu.Dockerfile .
# Build 10.1
DOCKER_BUILDKIT=1 docker build -t jolibrain/deepdetect_gpu_10_1 --no-cache -f docker/gpu_10.1.Dockerfile .
When the two images are available we can run those not at the same time in an instance without any usage of the GPUS. Here is what we can observe with cuda 10.2
docker run --runtime=nvidia -d -p 8080:8080 jolibrain/deepdetect_gpu_10_1
We then create a service with googlenet:
curl -X PUT "http://localhost:8080/services/imageserv" -d "{\"mllib\":\"caffe\",\"description\":\"image classification service\",\"type\":\"supervised\",\"parameters\":{\"input\":{\"connector\":\"image\"},\"mllib\":{\"nclasses\":1000}},\"model\":{\"repository\":\"/opt/models/ggnet/\"}}"
And observe 2804Mib used for this model.
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 23% 35C P8 17W / 250W | 2804MiB / 11177MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|============================================= |
| 0 16819 C ./dede 2804MiB |
+-----------------------------------------------------------------------------+
Now if we launch a prediction
curl -X POST "http://localhost:8080/predict" -d "{\"service\":\"imageserv\",\"parameters\":{\"input\":{\"width\":224,\"height\":224},\"output\":{\"best\":3},\"mllib\":{\"gpu\":true}},\"data\":[\"http://i.ytimg.com/vi/0vxOhd4qlnA/maxresdefault.jpg\"]}"
We observe an increase of 1GiB.
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 24% 39C P8 18W / 250W | 2828MiB / 11177MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|==============================================|
| 0 16819 C ./dede 3944MiB |
+-----------------------------------------------------------------------------+
If we do know the same exercise using tu cuda10.1 build.
docker run --runtime=nvidia -d -p 8080:8080 jolibrain/deepdetect_gpu_10_1
We then create a service with googlenet:
curl -X PUT "http://localhost:8080/services/imageserv" -d "{\"mllib\":\"caffe\",\"description\":\"image classification service\",\"type\":\"supervised\",\"parameters\":{\"input\":{\"connector\":\"image\"},\"mllib\":{\"nclasses\":1000}},\"model\":{\"repository\":\"/opt/models/ggnet/\"}}"
And observe 932Mib used for this model. 3X times less ram than when we used cuda 10.2
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 26% 43C P8 18W / 250W | 942MiB / 11177MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|==============================================|
| 1 25881 C ./dede 932MiB |
+-----------------------------------------------------------------------------+
Now if we launch a prediction
curl -X POST "http://localhost:8080/predict" -d "{\"service\":\"imageserv\",\"parameters\":{\"input\":{\"width\":224,\"height\":224},\"output\":{\"best\":3},\"mllib\":{\"gpu\":true}},\"data\":[\"http://i.ytimg.com/vi/0vxOhd4qlnA/maxresdefault.jpg\"]}"
We observe an increase of 1GiB. (in fine 2 times less RAM).
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 24% 39C P8 18W / 250W | 2066MiB / 11177MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|==============================================|
| 0 16819 C ./dede 2066MiB |
+-----------------------------------------------------------------------------+
We observed here that using cuda10.2 vs cuda10.1 raises significantly the RAM footprint of a simple model as googlenet.
Moreover note that when we make a prediction the 1Gb allocated more is also pretty weird. Playing a bit with the requests made me realise that this is due to the flag {"gpu": true} in the POST request. If I do not use this one there is not this increase.
curl -X POST "http://localhost:8080/predict" -d "{\"service\":\"imageserv\",\"parameters\":{\"input\":{\"width\":224,\"height\":224},\"output\":{\"best\":3}},\"data\":[\"http://i.ytimg.com/vi/0vxOhd4qlnA/maxresdefault.jpg\"]}"
We do not observe any increase.
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 23% 40C P8 18W / 250W | 942MiB / 11177MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================|
| 1 20164 C ./dede 932MiB |
+-----------------------------------------------------------------------------+
Note that I observed the same thing with cuda 10.2.
In fine, I observed two issues:
- cuda 10.2 vs cuda 10.1 memory footprint of a model
- flag "gpu":true in post request allocates 1Gib more RAM.
Hello, first thing is to study whether this might be due to cudnn. You need to try two things to find out about this :
- create your services with
"engine":"CUDNN_MIN_MEMORY"
and please let us know about the memory trace - build the docker images with
-DUSE_CUDNN=OFF
Also FYI bear in mind that what is reported is memory that is allocated internally by CUDA/Cudnn, not what is not available anymore. Typically Cudnn does not fully deallocate it's handles, but memory remains available somehow.
Using "engine":"CUDNN_MIN_MEMORY"
curl -X PUT "http://localhost:8080/services/imageserv" -d "{\"mllib\":\"caffe\",\"description\":\"image classification service\",\"type\":\"supervised\",\"parameters\":{\"input\":{\"connector\":\"image\"},\"mllib\":{\"nclasses\":1000, \"engine\":\"CUDNN_MIN_MEMORY\"}},\"model\":{\"repository\":\"/opt/models/ggnet/\"}}"
I observe :
- 598MiB usage vs 942Mib with cuda 10.1
- 1548MiB usage vs 2804Mib with cuda 10.2 The x3 factor still persists.
I'll try with -DUSE_CUDNN=OFF
but it will takes some time to build the images...
Note that I've given you a script to replicate, that would be nice to know if you observe the same thing on your side.
Thanks. Look at the cudnn versions that come with every cuda flavor (i.e. from the original nvidia docker image)
I've built new images without cudnn. Here is what I observe :
- 245MiB usage vs 942Mib with cuda 10.1
- 269MiB usage vs 2804Mib with cuda 10.2
It seems that the gap is lower now....
For the cudnn version it is the same for both images (cuda10_1 and cuda10_2): 7.6.5
dd@2dfd85f33243:/opt/deepdetect/build/main$ cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
Looks like a CUDNN/CUDA internal thing to me. I'd suggest you try the tensorrt
backend instead, there's no reason not to use it actually.
Hum I am not sure it will change anything as tensorrt uses cuda. Tensorrt7 requires cuda 10.2 and I may have seen an increase in memory usage (to verify) but I'll double check to be sure. Tensorrt does not support all the architectures yet (and I observed an important difference in predictions between caffe and tensorrt for a refinedet model, an issue will be raised soon) and I have some dependencies to use caffe before upgrading to tensorrt.
Note that the issue related to the flag gpu: true
(in post request it allocates 1Gib more RAM) still remains for both versions of cuda.
You can use the following script to create two images with DD and tensorrt as backend, One with cuda 10.1 and the other with cuda 10.2.
# Get certain specific of project
git clone https://github.com/jolibrain/deepdetect.git
cd deepdetect/docker
git checkout ec2f5e27470371d995e640d1f6c2722d08415051
# Transform all cuda 10.2 to cuda 10.1 in a new Dockerfile
cp gpu_tensorrt.Dockerfile gpu_10_1_tensorrt.Dockerfile
sed -i 's/tensorrt:20.03/tensorrt:19.10/' gpu_10_1_tensorrt.Dockerfile
sed -i 's/\# CMake/\#CMake\nRUN rm \/usr\/local\/bin\/cmake/' gpu_10_1_tensorrt.Dockerfile
sed -i 's/ARG DEEPDETECT_BUILD=default/ARG DEEPDETECT_BUILD=tensorrt/' gpu_tensorrt.Dockerfile
sed -i 's/ARG DEEPDETECT_BUILD=default/ARG DEEPDETECT_BUILD=tensorrt/' gpu_10_1_tensorrt.Dockerfile
# Go back to deepdetect folder
cd ..
# Build 10.2
DOCKER_BUILDKIT=1 docker build -t jolibrain/deepdetect_gpu_tensorrt --no-cache -f docker/gpu_tensorrt.Dockerfile .
# Build 10.1
DOCKER_BUILDKIT=1 docker build -t jolibrain/deepdetect_gpu_10_1_tensorrt --no-cache -f docker/gpu_10_1_tensorrt.Dockerfile .
We observe a 10% increase in memory usage for cuda 10.2 and as explained before I still need to use caffe for some spefific models not supported yet by tensorrt.
Did you observe the same thing on your side?
It might also come from caffe not supporting correctly cuda 10.2, I am just throwing out some ideas...
You can look for yourself, it's basic cudnn calls: https://github.com/jolibrain/caffe/blob/master/src/caffe/layers/cudnn_conv_layer.cu and we recently updated for CUDNN8, https://github.com/jolibrain/caffe/pull/75
If you doubt the implementation you'll see ours (e.g. with cudnn8) is similar to that of OpenCV, https://github.com/opencv/opencv/pull/17685 and https://github.com/opencv/opencv/issues/17496
If you'd like to digg further, you can find NVidia useless answer to our findings that cudnn doesn't free the handles: https://forums.developer.nvidia.com/t/cudnn-create-handle-t-usage-and-memory-reuse/111257
The memory issue is even worse with CUDA 11.1, see https://github.com/jolibrain/caffe/pull/78
You may want to try your docker with CUDA 11.1 + CUDNN8 since in all cases this is the new present/future.
Regarding tensorrt, unless you are using LSTM layers, all training layers supported in DD are supported at the moment, AFAIK.
Thanks for all this information, it seems weird as I did not observe this kind of memory usage using other DL frameworks. I will have a deeper look. I never tried training with your tensorrt backend as it seems fairly new and we experienced some bugs with it on the inference side. Moreover according to the documentation it only supports image connector. Do you have some documentation about what is happening when a model is loaded when the service is created (as this is where the memory allocation is tripled). In my "basic" understanding, I thought that creation of a service would allocate all the memory necessary to launch a prediction with a certain batch_size but it does not seem to be the case and when we launch a prediction an additional memory allocation is made for the specific prediction. Sometimes it evens increments after some requests without knowing why. FYI, I am not criticizing the heavy work you did and you continue to do; I am very impressed and I encouraged you to continue. On my side I am just trying to find my a solution to my issue and I find it very award that the load of a model varies that much from one version of cuda to another one.
We don't have more info than what's in the code and CUDNN doc. The CUDNN memory issues are everywhere if you look for them. Your tests clearly indicate that it's a CUDNN issue with underlying CUDA something. CUDNN preallocates then dynamically allocate depending on underlying algorithms (FFT, Winograd, ...), you can go read about that in the CUDNN doc and elsewhere. It also preallocate around batch size. Typically you can try adding more image to a single predict call and see how the DRAM allocation varies (in my tests, predicting with three different URLs as a batch does not move the DRAM allocation, so my guess is that what you are witnessing is "smart" preallocation along the batch size tensor dim). You may want to check the speed difference with / without CUDNN and decide whether the speed gain is worth the memory usage.
it seems that cudnn 8.0.0 has fixed some problems, see fixes
section of
https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel-800-Preview