imagenet-multiGPU.torch icon indicating copy to clipboard operation
imagenet-multiGPU.torch copied to clipboard

memory usage of gpu0 is doubled when use Multi-GPU training

Open ffmpbgrnn opened this issue 9 years ago • 10 comments

Hi, I am new to torch and I tried to train VGG model using the model file of models/vgg_cudnn.lua. I used 4 K20 and found the memory usage of GPU0 is about doubled (4199MiB) comprared with the others (2495MiB each). And as the comment in models/vgg_cudnn.lua, I did meet the problem of run out of memory using VGG-D. Can you please give me any advice on this? Thank you very much in advance

ffmpbgrnn avatar Jun 21 '15 11:06 ffmpbgrnn

you can try to substitute https://github.com/soumith/imagenet-multiGPU.torch/blob/master/models/vgg_cudnn.lua#L39 and https://github.com/soumith/imagenet-multiGPU.torch/blob/master/models/vgg_cudnn.lua#L42 with in-place relu nn.ReLU(true). As those are large matrices it might help. And update to the latest cudnn.torch bindings.

szagoruyko avatar Jun 21 '15 11:06 szagoruyko

Hey, thank you for you quick reply! I tried to replace those two lines and update cudnn. But this time I get the following result

+------------------------------------------------------+
| NVIDIA-SMI 340.65     Driver Version: 340.65         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          Off  | 0000:02:00.0     Off |                    0 |
| N/A   47C    P0   149W / 225W |   4438MiB /  4799MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          Off  | 0000:03:00.0     Off |                    0 |
| N/A   47C    P0   152W / 225W |   2742MiB /  4799MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20m          Off  | 0000:83:00.0     Off |                    0 |
| N/A   50C    P0   151W / 225W |   2742MiB /  4799MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20m          Off  | 0000:84:00.0     Off |                    0 |
| N/A   46C    P0   147W / 225W |   2742MiB /  4799MiB |     86%      Default |
+-------------------------------+----------------------+----------------------+

And I met the output of memory problem at the end of the first epoch, with the following log:

Epoch: [1][9995/10000]  Time 1.319 Err 6.9073 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][9996/10000]  Time 1.327 Err 6.9092 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][9997/10000]  Time 1.328 Err 6.9095 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][9998/10000]  Time 1.322 Err 6.9062 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][9999/10000]  Time 1.329 Err 6.9077 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][10000/10000] Time 1.328 Err 6.9071 Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.002
Epoch: [1][TRAINING SUMMARY] Total Time(s): 13325.76    average loss (per batch): 6.91   accuracy(%):    top-1 0.10


==> doing epoch on validation data:
==> online epoch # 1
/home/archy/torch/bin/luajit: ...e/archy/torch/share/lua/5.1/threads/threads.lua:255:
[thread 19 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...n/archy/torch/share/lua/5.1/cudnn/SpatialConvolution.lua:97: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error
 (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCGeneral.c:241
[thread 15 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/lua
rocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93
[thread 17 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch
/lib/THC/THCGeneral.c:241
[thread 4 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/luar
ocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93
[thread 16 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch
/lib/THC/THCGeneral.c:241
[thread 8 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/luar
ocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93
[thread 18 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch
/lib/THC/THCGeneral.c:241

Thank you very much!

ffmpbgrnn avatar Jun 22 '15 02:06 ffmpbgrnn

Did you read the error at all? What does it say?

On Sunday, June 21, 2015, Linchao Zhu [email protected] wrote:

Hey, thank you for you quick reply! I tried to replace those two lines and update cudnn. But this time I get the following result

+------------------------------------------------------+ | NVIDIA-SMI 340.65 Driver Version: 340.65 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20m Off | 0000:02:00.0 Off | 0 | | N/A 47C P0 149W / 225W | 4438MiB / 4799MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20m Off | 0000:03:00.0 Off | 0 | | N/A 47C P0 152W / 225W | 2742MiB / 4799MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K20m Off | 0000:83:00.0 Off | 0 | | N/A 50C P0 151W / 225W | 2742MiB / 4799MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K20m Off | 0000:84:00.0 Off | 0 | | N/A 46C P0 147W / 225W | 2742MiB / 4799MiB | 86% Default | +-------------------------------+----------------------+----------------------+

And I met the output of memory problem at the end of the first epoch, with the following log:

Epoch: [1][9995/10000] Time 1.319 Err 6.9073 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9996/10000] Time 1.327 Err 6.9092 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9997/10000] Time 1.328 Err 6.9095 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9998/10000] Time 1.322 Err 6.9062 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9999/10000] Time 1.329 Err 6.9077 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][10000/10000] Time 1.328 Err 6.9071 Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][TRAINING SUMMARY] Total Time(s): 13325.76 average loss (per batch): 6.91 accuracy(%): top-1 0.10

==> doing epoch on validation data: ==> online epoch # 1 /home/archy/torch/bin/luajit: ...e/archy/torch/share/lua/5.1/threads/threads.lua:255: [thread 19 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...n/archy/torch/share/lua/5.1/cudnn/SpatialConvolution.lua:97: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCGeneral.c:241 [thread 15 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/lua rocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93 [thread 17 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch /lib/THC/THCGeneral.c:241 [thread 4 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/luar ocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93 [thread 16 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch /lib/THC/THCGeneral.c:241 [thread 8 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/luar ocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93 [thread 18 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch /lib/THC/THCGeneral.c:241

Thank you very much!

— Reply to this email directly or view it on GitHub https://github.com/soumith/imagenet-multiGPU.torch/issues/6#issuecomment-113982628 .

soumith avatar Jun 22 '15 02:06 soumith

Vggd is probably not going to fit in k20

On Sunday, June 21, 2015, soumith [email protected] wrote:

Did you read the error at all? What does it say?

On Sunday, June 21, 2015, Linchao Zhu <[email protected] javascript:_e(%7B%7D,'cvml','[email protected]');> wrote:

Hey, thank you for you quick reply! I tried to replace those two lines and update cudnn. But this time I get the following result

+------------------------------------------------------+ | NVIDIA-SMI 340.65 Driver Version: 340.65 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20m Off | 0000:02:00.0 Off | 0 | | N/A 47C P0 149W / 225W | 4438MiB / 4799MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20m Off | 0000:03:00.0 Off | 0 | | N/A 47C P0 152W / 225W | 2742MiB / 4799MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K20m Off | 0000:83:00.0 Off | 0 | | N/A 50C P0 151W / 225W | 2742MiB / 4799MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K20m Off | 0000:84:00.0 Off | 0 | | N/A 46C P0 147W / 225W | 2742MiB / 4799MiB | 86% Default | +-------------------------------+----------------------+----------------------+

And I met the output of memory problem at the end of the first epoch, with the following log:

Epoch: [1][9995/10000] Time 1.319 Err 6.9073 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9996/10000] Time 1.327 Err 6.9092 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9997/10000] Time 1.328 Err 6.9095 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9998/10000] Time 1.322 Err 6.9062 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][9999/10000] Time 1.329 Err 6.9077 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][10000/10000] Time 1.328 Err 6.9071 Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][TRAINING SUMMARY] Total Time(s): 13325.76 average loss (per batch): 6.91 accuracy(%): top-1 0.10

==> doing epoch on validation data: ==> online epoch # 1 /home/archy/torch/bin/luajit: ...e/archy/torch/share/lua/5.1/threads/threads.lua:255: [thread 19 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...n/archy/torch/share/lua/5.1/cudnn/SpatialConvolution.lua:97: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCGeneral.c:241 [thread 15 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/lua rocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93 [thread 17 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch /lib/THC/THCGeneral.c:241 [thread 4 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/luar ocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93 [thread 16 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch /lib/THC/THCGeneral.c:241 [thread 8 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: ...magenet-multiGPU.torch/fbcunn_files/AbstractParallel.lua:97: out of memory at /tmp/luar ocks_cutorch-scm-1-7449/cutorch/lib/THC/THCTensorCopy.cu:93 [thread 18 endcallback] /home/archy/torch/share/lua/5.1/cutorch/init.lua:21: /tmp/luarocks_cutorch-scm-1-7449/cutorch/lib/THC/THCStorage.cu(30) : cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7449/cutorch /lib/THC/THCGeneral.c:241

Thank you very much!

— Reply to this email directly or view it on GitHub https://github.com/soumith/imagenet-multiGPU.torch/issues/6#issuecomment-113982628 .

soumith avatar Jun 22 '15 02:06 soumith

Hi, sorry for the confusion. Let me explain more clearly. I trained with VGG-A with ReLU and latest cudnn.torch binding. And here is the nvidia-smi result:

Sun Jun 21 22:42:52 2015
+------------------------------------------------------+
| NVIDIA-SMI 340.65     Driver Version: 340.65         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          Off  | 0000:02:00.0     Off |                    0 |
| N/A   46C    P0   153W / 225W |   4438MiB /  4799MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          Off  | 0000:03:00.0     Off |                    0 |
| N/A   46C    P0   147W / 225W |   2742MiB /  4799MiB |     51%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20m          Off  | 0000:83:00.0     Off |                    0 |
| N/A   49C    P0   155W / 225W |   2742MiB /  4799MiB |     54%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20m          Off  | 0000:84:00.0     Off |                    0 |
| N/A   45C    P0   146W / 225W |   2742MiB /  4799MiB |     56%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0     13440  /home/archy/torch/bin/luajit                        4422MiB |
|    1     13440  /home/archy/torch/bin/luajit                        2726MiB |
|    2     13440  /home/archy/torch/bin/luajit                        2726MiB |
|    3     13440  /home/archy/torch/bin/luajit                        2726MiB |
+-----------------------------------------------------------------------------+

As you can see, the memory usage of GPU0 is much higher than the others GPUs (4438MiB vs 2742MiB). And I tried to train Alexnet with 2 GPUs with the following log:

Sun Jun 21 22:54:24 2015
+------------------------------------------------------+
| NVIDIA-SMI 340.65     Driver Version: 340.65         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          Off  | 0000:02:00.0     Off |                    0 |
| N/A   41C    P0   121W / 225W |   1800MiB /  4799MiB |     91%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          Off  | 0000:03:00.0     Off |                    0 |
| N/A   41C    P0   127W / 225W |    932MiB /  4799MiB |     89%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20m          Off  | 0000:83:00.0     Off |                    0 |
| N/A   40C    P0    48W / 225W |    394MiB /  4799MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20m          Off  | 0000:84:00.0     Off |                    0 |
| N/A   36C    P0    46W / 225W |    394MiB /  4799MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0      3345  /home/archy/torch/bin/luajit                        1784MiB |
|    1      3345  /home/archy/torch/bin/luajit                         916MiB |
|    2      3345  /home/archy/torch/bin/luajit                         378MiB |
|    3      3345  /home/archy/torch/bin/luajit                         378MiB |
+-----------------------------------------------------------------------------+

GPU0(1784MiB) costs doubled memory to GPU1(916MiB). I thought GPU0 might be the reason for the problem of run out of memory. Thank for very much!

ffmpbgrnn avatar Jun 22 '15 03:06 ffmpbgrnn

Hi, could you try to reduce the batch size, using the commandline option: -batchSize 32

or even reduce it to:

-batchSize 16

See if that helps.

On Sun, Jun 21, 2015 at 11:01 PM, Linchao Zhu [email protected] wrote:

Hi, sorry for the confusion. Let me explain more clearly. I trained with VGG-A with ReLU and latest cudnn.torch binding. And here is the nvidia-smi result:

Sun Jun 21 22:42:52 2015 +------------------------------------------------------+ | NVIDIA-SMI 340.65 Driver Version: 340.65 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20m Off | 0000:02:00.0 Off | 0 | | N/A 46C P0 153W / 225W | 4438MiB / 4799MiB | 98% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20m Off | 0000:03:00.0 Off | 0 | | N/A 46C P0 147W / 225W | 2742MiB / 4799MiB | 51% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K20m Off | 0000:83:00.0 Off | 0 | | N/A 49C P0 155W / 225W | 2742MiB / 4799MiB | 54% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K20m Off | 0000:84:00.0 Off | 0 | | N/A 45C P0 146W / 225W | 2742MiB / 4799MiB | 56% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 13440 /home/archy/torch/bin/luajit 4422MiB | | 1 13440 /home/archy/torch/bin/luajit 2726MiB | | 2 13440 /home/archy/torch/bin/luajit 2726MiB | | 3 13440 /home/archy/torch/bin/luajit 2726MiB | +-----------------------------------------------------------------------------+

As you can see, the memory usage of GPU0 is much higher than the others GPUs (4438MiB vs 2742MiB). And I tried to train Alexnet with 2 GPUs with the following log:

Sun Jun 21 22:54:24 2015 +------------------------------------------------------+ | NVIDIA-SMI 340.65 Driver Version: 340.65 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20m Off | 0000:02:00.0 Off | 0 | | N/A 41C P0 121W / 225W | 1800MiB / 4799MiB | 91% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20m Off | 0000:03:00.0 Off | 0 | | N/A 41C P0 127W / 225W | 932MiB / 4799MiB | 89% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K20m Off | 0000:83:00.0 Off | 0 | | N/A 40C P0 48W / 225W | 394MiB / 4799MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K20m Off | 0000:84:00.0 Off | 0 | | N/A 36C P0 46W / 225W | 394MiB / 4799MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 3345 /home/archy/torch/bin/luajit 1784MiB | | 1 3345 /home/archy/torch/bin/luajit 916MiB | | 2 3345 /home/archy/torch/bin/luajit 378MiB | | 3 3345 /home/archy/torch/bin/luajit 378MiB | +-----------------------------------------------------------------------------+

GPU0(1784MiB) costs doubled memory to GPU1(916MiB). I thought GPU0 might be the reason for the problem of run out of memory. Thank for very much!

— Reply to this email directly or view it on GitHub https://github.com/soumith/imagenet-multiGPU.torch/issues/6#issuecomment-113988753 .

soumith avatar Jun 22 '15 03:06 soumith

HI @soumith, I am together with @ffmpbgrnn. I think @ffmpbgrnn 's question is simply that

why does GPU 0 utilize double GPU memory as GPU 1 (in two GPUs condition)?

We may just ignore the out-of-memory problem firstly. The batch-size currently can fit the GPUs well and the point is about the difference between GPU 0 and other GPUs.

Another weird thing is that when set -nGPU 2, GPU 2 and GPU 3 have memory occupation of about 400 MB.

zhongwen avatar Jun 22 '15 03:06 zhongwen

hey @zhongwen , this is because when we synchronize the weights across GPUs, we provide dedicated buffers on GPU-1 for all weights to accumulate into (so that the weight transfer from GPU{2,3,4} over to GPU1 are done in parallel and non-blocking). That's why you see GPU-1 having more memory usage.

soumith avatar Jun 22 '15 03:06 soumith

To my understanding, the total size of the model is about 500MB, So why does the buffer occupy a lot more than this?

zhongwen avatar Jun 22 '15 03:06 zhongwen

hey @zhongwen , You also have to take into account on GPU0, the Linear layers. They are only running on GPU0 and not running on GPU1. The DataParallel is only on the Convolution layers.

Does that make sense?

Essentially, you have about ~400-500MB overhead coming from the multi-GPU buffers (probably, I would say lesser than that), and you have the overhead from the Fully-connected layers and SoftMax

soumith avatar Jun 22 '15 03:06 soumith