imagenet-multiGPU.torch
imagenet-multiGPU.torch copied to clipboard
memory usage of gpu0 is doubled when use Multi-GPU training
Hi,
I am new to torch and I tried to train VGG model using the model file of models/vgg_cudnn.lua
. I used 4 K20 and found the memory usage of GPU0 is about doubled (4199MiB) comprared with the others (2495MiB each). And as the comment in models/vgg_cudnn.lua
, I did meet the problem of run out of memory using VGG-D.
Can you please give me any advice on this?
Thank you very much in advance
you can try to substitute https://github.com/soumith/imagenet-multiGPU.torch/blob/master/models/vgg_cudnn.lua#L39 and https://github.com/soumith/imagenet-multiGPU.torch/blob/master/models/vgg_cudnn.lua#L42 with in-place relu nn.ReLU(true). As those are large matrices it might help. And update to the latest cudnn.torch bindings.
Hey, thank you for you quick reply! I tried to replace those two lines and update cudnn. But this time I get the following result
+------------------------------------------------------+
| NVIDIA-SMI 340.65 Driver Version: 340.65 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20m Off | 0000:02:00.0 Off | 0 |
| N/A 47C P0 149W / 225W | 4438MiB / 4799MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20m Off | 0000:03:00.0 Off | 0 |
| N/A 47C P0 152W / 225W | 2742MiB / 4799MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K20m Off | 0000:83:00.0 Off | 0 |
| N/A 50C P0 151W / 225W | 2742MiB / 4799MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K20m Off | 0000:84:00.0 Off | 0 |
| N/A 46C P0 147W / 225W | 2742MiB / 4799MiB | 86% Default |
+-------------------------------+----------------------+----------------------+
And I met the output of memory problem at the end of the first epoch, with the following log:
Epoch: [1][9995/10000] Time 1.319 Err 6.9073 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][9996/10000] Time 1.327 Err 6.9092 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][9997/10000] Time 1.328 Err 6.9095 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][9998/10000] Time 1.322 Err 6.9062 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][9999/10000] Time 1.329 Err 6.9077 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][10000/10000] Time 1.328 Err 6.9071 Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][TRAINING SUMMARY] Total Time(s): 13325.76 average loss (per batch): 6.91 accuracy(%): top-1 0.10
==> doing epoch on validation data:
==> online epoch # 1
/home/archy/torch/bin/luajit: ...e/archy/torch/share/lua/5.1/threads/threads.lua:255:
[thread 19 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...n/archy/torch/share/lua/5.1/cudnn/SpatialConvolution.lua:97: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error
(2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCGeneral.c:241
[thread 15 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/lua
rocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93
[thread 17 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch
/lib/THC/THCGeneral.c:241
[thread 4 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/luar
ocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93
[thread 16 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch
/lib/THC/THCGeneral.c:241
[thread 8 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/luar
ocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93
[thread 18 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch
/lib/THC/THCGeneral.c:241
Thank you very much!
Did you read the error at all? What does it say?
On Sunday, June 21, 2015, Linchao Zhu [email protected] wrote:
Hey, thank you for you quick reply! I tried to replace those two lines and update cudnn. But this time I get the following result
+------------------------------------------------------+ | NVIDIA-SMI 340.65 Driver Version: 340.65 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20m Off | 0000:02:00.0 Off | 0 | | N/A 47C P0 149W / 225W | 4438MiB / 4799MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20m Off | 0000:03:00.0 Off | 0 | | N/A 47C P0 152W / 225W | 2742MiB / 4799MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K20m Off | 0000:83:00.0 Off | 0 | | N/A 50C P0 151W / 225W | 2742MiB / 4799MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K20m Off | 0000:84:00.0 Off | 0 | | N/A 46C P0 147W / 225W | 2742MiB / 4799MiB | 86% Default | +-------------------------------+----------------------+----------------------+
And I met the output of memory problem at the end of the first epoch, with the following log:
Epoch: [1][9995/10000] Time 1.319 Err 6.9073 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9996/10000] Time 1.327 Err 6.9092 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9997/10000] Time 1.328 Err 6.9095 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9998/10000] Time 1.322 Err 6.9062 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9999/10000] Time 1.329 Err 6.9077 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][10000/10000] Time 1.328 Err 6.9071 Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][TRAINING SUMMARY] Total Time(s): 13325.76 average loss (per batch): 6.91 accuracy(%): top-1 0.10
==> doing epoch on validation data: ==> online epoch # 1 /home/archy/torch/bin/luajit: ...e/archy/torch/share/lua/5.1/threads/threads.lua:255: [thread 19 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...n/archy/torch/share/lua/5.1/cudnn/SpatialConvolution.lua:97: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCGeneral.c:241 [thread 15 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/lua rocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93 [thread 17 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch /lib/THC/THCGeneral.c:241 [thread 4 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/luar ocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93 [thread 16 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch /lib/THC/THCGeneral.c:241 [thread 8 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/luar ocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93 [thread 18 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch /lib/THC/THCGeneral.c:241
Thank you very much!
— Reply to this email directly or view it on GitHub https://github.com/soumith/imagenet-multiGPU.torch/issues/6#issuecomment-113982628 .
Vggd is probably not going to fit in k20
On Sunday, June 21, 2015, soumith [email protected] wrote:
Did you read the error at all? What does it say?
On Sunday, June 21, 2015, Linchao Zhu <[email protected] javascript:_e(%7B%7D,'cvml','[email protected]');> wrote:
Hey, thank you for you quick reply! I tried to replace those two lines and update cudnn. But this time I get the following result
+------------------------------------------------------+ | NVIDIA-SMI 340.65 Driver Version: 340.65 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20m Off | 0000:02:00.0 Off | 0 | | N/A 47C P0 149W / 225W | 4438MiB / 4799MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20m Off | 0000:03:00.0 Off | 0 | | N/A 47C P0 152W / 225W | 2742MiB / 4799MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K20m Off | 0000:83:00.0 Off | 0 | | N/A 50C P0 151W / 225W | 2742MiB / 4799MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K20m Off | 0000:84:00.0 Off | 0 | | N/A 46C P0 147W / 225W | 2742MiB / 4799MiB | 86% Default | +-------------------------------+----------------------+----------------------+
And I met the output of memory problem at the end of the first epoch, with the following log:
Epoch: [1][9995/10000] Time 1.319 Err 6.9073 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9996/10000] Time 1.327 Err 6.9092 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9997/10000] Time 1.328 Err 6.9095 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9998/10000] Time 1.322 Err 6.9062 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9999/10000] Time 1.329 Err 6.9077 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][10000/10000] Time 1.328 Err 6.9071 Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][TRAINING SUMMARY] Total Time(s): 13325.76 average loss (per batch): 6.91 accuracy(%): top-1 0.10
==> doing epoch on validation data: ==> online epoch # 1 /home/archy/torch/bin/luajit: ...e/archy/torch/share/lua/5.1/threads/threads.lua:255: [thread 19 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...n/archy/torch/share/lua/5.1/cudnn/SpatialConvolution.lua:97: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCGeneral.c:241 [thread 15 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/lua rocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93 [thread 17 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch /lib/THC/THCGeneral.c:241 [thread 4 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/luar ocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93 [thread 16 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch /lib/THC/THCGeneral.c:241 [thread 8 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/luar ocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93 [thread 18 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch /lib/THC/THCGeneral.c:241
Thank you very much!
— Reply to this email directly or view it on GitHub https://github.com/soumith/imagenet-multiGPU.torch/issues/6#issuecomment-113982628 .
Hi, sorry for the confusion. Let me explain more clearly.
I trained with VGG-A with ReLU and latest cudnn.torch binding. And here is the nvidia-smi
result:
Sun Jun 21 22:42:52 2015
+------------------------------------------------------+
| NVIDIA-SMI 340.65 Driver Version: 340.65 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20m Off | 0000:02:00.0 Off | 0 |
| N/A 46C P0 153W / 225W | 4438MiB / 4799MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20m Off | 0000:03:00.0 Off | 0 |
| N/A 46C P0 147W / 225W | 2742MiB / 4799MiB | 51% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K20m Off | 0000:83:00.0 Off | 0 |
| N/A 49C P0 155W / 225W | 2742MiB / 4799MiB | 54% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K20m Off | 0000:84:00.0 Off | 0 |
| N/A 45C P0 146W / 225W | 2742MiB / 4799MiB | 56% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 13440 /home/archy/torch/bin/luajit 4422MiB |
| 1 13440 /home/archy/torch/bin/luajit 2726MiB |
| 2 13440 /home/archy/torch/bin/luajit 2726MiB |
| 3 13440 /home/archy/torch/bin/luajit 2726MiB |
+-----------------------------------------------------------------------------+
As you can see, the memory usage of GPU0 is much higher than the others GPUs (4438MiB vs 2742MiB). And I tried to train Alexnet with 2 GPUs with the following log:
Sun Jun 21 22:54:24 2015
+------------------------------------------------------+
| NVIDIA-SMI 340.65 Driver Version: 340.65 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20m Off | 0000:02:00.0 Off | 0 |
| N/A 41C P0 121W / 225W | 1800MiB / 4799MiB | 91% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20m Off | 0000:03:00.0 Off | 0 |
| N/A 41C P0 127W / 225W | 932MiB / 4799MiB | 89% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K20m Off | 0000:83:00.0 Off | 0 |
| N/A 40C P0 48W / 225W | 394MiB / 4799MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K20m Off | 0000:84:00.0 Off | 0 |
| N/A 36C P0 46W / 225W | 394MiB / 4799MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 3345 /home/archy/torch/bin/luajit 1784MiB |
| 1 3345 /home/archy/torch/bin/luajit 916MiB |
| 2 3345 /home/archy/torch/bin/luajit 378MiB |
| 3 3345 /home/archy/torch/bin/luajit 378MiB |
+-----------------------------------------------------------------------------+
GPU0(1784MiB) costs doubled memory to GPU1(916MiB). I thought GPU0 might be the reason for the problem of run out of memory. Thank for very much!
Hi, could you try to reduce the batch size, using the commandline option: -batchSize 32
or even reduce it to:
-batchSize 16
See if that helps.
On Sun, Jun 21, 2015 at 11:01 PM, Linchao Zhu [email protected] wrote:
Hi, sorry for the confusion. Let me explain more clearly. I trained with VGG-A with ReLU and latest cudnn.torch binding. And here is the nvidia-smi result:
Sun Jun 21 22:42:52 2015 +------------------------------------------------------+ | NVIDIA-SMI 340.65 Driver Version: 340.65 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20m Off | 0000:02:00.0 Off | 0 | | N/A 46C P0 153W / 225W | 4438MiB / 4799MiB | 98% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20m Off | 0000:03:00.0 Off | 0 | | N/A 46C P0 147W / 225W | 2742MiB / 4799MiB | 51% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K20m Off | 0000:83:00.0 Off | 0 | | N/A 49C P0 155W / 225W | 2742MiB / 4799MiB | 54% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K20m Off | 0000:84:00.0 Off | 0 | | N/A 45C P0 146W / 225W | 2742MiB / 4799MiB | 56% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 13440 /home/archy/torch/bin/luajit 4422MiB | | 1 13440 /home/archy/torch/bin/luajit 2726MiB | | 2 13440 /home/archy/torch/bin/luajit 2726MiB | | 3 13440 /home/archy/torch/bin/luajit 2726MiB | +-----------------------------------------------------------------------------+
As you can see, the memory usage of GPU0 is much higher than the others GPUs (4438MiB vs 2742MiB). And I tried to train Alexnet with 2 GPUs with the following log:
Sun Jun 21 22:54:24 2015 +------------------------------------------------------+ | NVIDIA-SMI 340.65 Driver Version: 340.65 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20m Off | 0000:02:00.0 Off | 0 | | N/A 41C P0 121W / 225W | 1800MiB / 4799MiB | 91% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20m Off | 0000:03:00.0 Off | 0 | | N/A 41C P0 127W / 225W | 932MiB / 4799MiB | 89% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K20m Off | 0000:83:00.0 Off | 0 | | N/A 40C P0 48W / 225W | 394MiB / 4799MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K20m Off | 0000:84:00.0 Off | 0 | | N/A 36C P0 46W / 225W | 394MiB / 4799MiB | 0% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 3345 /home/archy/torch/bin/luajit 1784MiB | | 1 3345 /home/archy/torch/bin/luajit 916MiB | | 2 3345 /home/archy/torch/bin/luajit 378MiB | | 3 3345 /home/archy/torch/bin/luajit 378MiB | +-----------------------------------------------------------------------------+
GPU0(1784MiB) costs doubled memory to GPU1(916MiB). I thought GPU0 might be the reason for the problem of run out of memory. Thank for very much!
— Reply to this email directly or view it on GitHub https://github.com/soumith/imagenet-multiGPU.torch/issues/6#issuecomment-113988753 .
HI @soumith, I am together with @ffmpbgrnn. I think @ffmpbgrnn 's question is simply that
why does GPU 0 utilize double GPU memory as GPU 1 (in two GPUs condition)?
We may just ignore the out-of-memory problem firstly. The batch-size currently can fit the GPUs well and the point is about the difference between GPU 0 and other GPUs.
Another weird thing is that when set -nGPU 2
, GPU 2 and GPU 3 have memory occupation of about 400 MB.
hey @zhongwen , this is because when we synchronize the weights across GPUs, we provide dedicated buffers on GPU-1 for all weights to accumulate into (so that the weight transfer from GPU{2,3,4} over to GPU1 are done in parallel and non-blocking). That's why you see GPU-1 having more memory usage.
To my understanding, the total size of the model is about 500MB, So why does the buffer occupy a lot more than this?
hey @zhongwen , You also have to take into account on GPU0, the Linear layers. They are only running on GPU0 and not running on GPU1. The DataParallel is only on the Convolution layers.
Does that make sense?
Essentially, you have about ~400-500MB overhead coming from the multi-GPU buffers (probably, I would say lesser than that), and you have the overhead from the Fully-connected layers and SoftMax