DeepSpeed
DeepSpeed copied to clipboard
DeepSpeed still gives CUDA-out-of-memory issue
May I know why this training code still gives CUDA-out-of-memory issue even after DeepSpeed is turned on ?

See this for historical tracking purpose.
DeepSpeed's memory optimizations are enabled via ZeRO stages >=1 and most of them are effective in multi-gpu cases. Your log shows that ZeRO is not enabled.

Please see these tutorials
- https://www.deepspeed.ai/tutorials/zero/
- https://www.deepspeed.ai/tutorials/zero-offload/
[2022-09-09 15:01:50,764] [INFO] [logging.py:68:log_dist] [Rank 0] step=1248, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:50,765] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=14.454011142357578, CurrSamplesPerSec=9.851701418075939, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
[2022-09-09 15:01:51,547] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:51,547] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=14.449232505063256, CurrSamplesPerSec=10.233603997501561, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
[2022-09-09 15:01:52,349] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:52,349] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=14.444045524931788, CurrSamplesPerSec=9.977597824784409, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
Traceback (most recent call last):
File "gdas.py", line 947, in <module>
ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
File "gdas.py", line 737, in train_NN
torch.save(graph, path)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 589, in _save
pickler.dump(obj)
AttributeError: Can't pickle local object 'instrument_w_nvtx.<locals>.wrapped_fn'
[6eb301421b63:00906] *** Process received signal ***
[6eb301421b63:00906] Signal: Segmentation fault (11)
[6eb301421b63:00906] Signal code: Address not mapped (1)
[6eb301421b63:00906] Failing at address: 0x7f4e4f16420d
[6eb301421b63:00906] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f4e52011980]
[6eb301421b63:00906] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f4e51c50775]
[6eb301421b63:00906] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f4e524bbe44]
[6eb301421b63:00906] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f4e51c51605]
[6eb301421b63:00906] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f4e524b9cb3]
[6eb301421b63:00906] *** End of error message ***
I have updated the optimization pass as follows, however I am getting another different error related to pickler above.
{
"train_micro_batch_size_per_gpu": 8,
"steps_per_print": 1,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.05
}
},
"zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "cpu"
},
"offload_param": {
"device": "cpu"
}
}
}
@buttercutter, can you please try stage == 1? We can return to this stage 3 issue later.
@tjruwase "stage": 1 resulted in the following out-of-memory issue
[2022-09-09 15:45:39,534] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=21.891309503059055, CurrSamplesPerSec=12.814082860076859, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-09 15:45:40,142] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:45:40,142] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=21.87964277609462, CurrSamplesPerSec=13.148494490509256, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-09 15:45:40,755] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:45:40,755] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=21.867808169940616, CurrSamplesPerSec=13.059329395683946, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
Finished train_NN()
Traceback (most recent call last):
File "gdas.py", line 955, in <module>
lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
File "gdas.py", line 780, in train_architecture
graph.forward(val_inputs, types="edge") # Lval(w*, alpha)
File "gdas.py", line 553, in forward
self.cells[c].forward(x, x1, x2, c, types=types)
File "gdas.py", line 439, in forward
self.nodes[n].forward(x, node_num=n, types=types) # Ltrain(w±, alpha)
File "gdas.py", line 317, in forward
y = self.connections[cc].forward(x, types)
File "gdas.py", line 274, in forward
edges_results = edges_results + self.edges[e].forward(x, types)
File "gdas.py", line 164, in forward
y_hat = self.forward_edge(x)
File "gdas.py", line 149, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[6eb301421b63:01205] *** Process received signal ***
[6eb301421b63:01205] Signal: Segmentation fault (11)
[6eb301421b63:01205] Signal code: Address not mapped (1)
[6eb301421b63:01205] Failing at address: 0x7f556779a20d
[6eb301421b63:01205] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f556a647980]
[6eb301421b63:01205] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f556a286775]
[6eb301421b63:01205] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f556aaf1e44]
[6eb301421b63:01205] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f556a287605]
[6eb301421b63:01205] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f556aaefcb3]
[6eb301421b63:01205] *** End of error message ***
@buttercutter, got it. This out-of-memory with stage 1 is quite strange because the model seems to train for a while, completing multiple iterations, before failing. The compute and memory pattern of model training is typically regular so that any out-of-memory issues should show up within one or two iterations. Is it possible that you have a memory leak? Can you please use deepspeed memory profiler as discussed here? Thanks!
@tjruwase From the code link you provided just above,
- What is
forward_step_func? I suppose this is not just simply the model wrapper as it feeds inargs.curriculum_learning? - Do I actually need to use
from megatron import get_timers? Since this is a timer, does this actually log the memory profile for each backpropagation paths ?
forward_step_funcis just the forward pass call. Actually, it could be any function that are you interested in how it affects memory.- No, you don't need timers import from megatron. Sorry, I just noticed that
see_memory_usage()is commented out in the code snippet.
Let's say you had a function foo, we can profile memory usage like below.
from deepspeed.runtime.utils import see_memory_usage
see_memory_usage(f'memory usage before foo', force=True)
foo(...)
see_memory_usage(f'memory usage after foo', force=True)
`
[phung@archlinux gdas]$ git status --short
M ds_config.json
M gdas.py
[phung@archlinux gdas]$ git diff gdas.py
diff --git a/gdas.py b/gdas.py
index 10c095c..e6a8b5c 100644
--- a/gdas.py
+++ b/gdas.py
@@ -13,6 +13,9 @@ import tensorflow as tf
# import numpy as np
+VISUALIZER = 0
+DEBUG = 0
+
# deepspeed zero offload https://www.deepspeed.ai/getting-started/
# https://github.com/microsoft/DeepSpeed/issues/2029
USE_DEEPSPEED = 1
@@ -21,8 +24,8 @@ if USE_DEEPSPEED:
import argparse
import deepspeed
-VISUALIZER = 0
-DEBUG = 0
+ from deepspeed.runtime.utils import see_memory_usage
+
logdir = 'runs/gdas_experiment_1'
if VISUALIZER:
@@ -944,7 +947,10 @@ if __name__ == "__main__":
while not_converged:
print("run_num = ", run_num)
+ see_memory_usage(f'memory usage before train_NN()', force=True)
ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
+ see_memory_usage(f'memory usage after train_NN()', force=True)
+
print("Finished train_NN()")
if VISUALIZER or DEBUG:
[phung@archlinux gdas]$
[phung@archlinux gdas]$ git diff ds_config.json
diff --git a/ds_config.json b/ds_config.json
index 91943c3..4afc33d 100644
--- a/ds_config.json
+++ b/ds_config.json
@@ -1,10 +1,27 @@
{
- "train_micro_batch_size_per_gpu": 8,
-"steps_per_print": 1,
- "optimizer": {
+ "train_micro_batch_size_per_gpu": 8,
+ "steps_per_print": 1,
+ "optimizer": {
"type": "AdamW",
"params": {
"lr": 0.05
}
- }
+ },
+
+ "zero_optimization": {
+ "stage": 1,
+ "contiguous_gradients": true,
+ "stage3_max_live_parameters": 1e9,
+ "stage3_max_reuse_distance": 1e9,
+ "stage3_prefetch_bucket_size": 1e7,
+ "stage3_param_persistence_threshold": 1e5,
+ "reduce_bucket_size": 1e7,
+ "sub_group_size": 1e9,
+ "offload_optimizer": {
+ "device": "cpu"
+ },
+ "offload_param": {
+ "device": "cpu"
+ }
+ }
}
[phung@archlinux gdas]$
@tjruwase With the above modifications to gdas.py training source code as well as the deepspeed json config file, I have the following terminal output log which is contained inside gdas.ipynb.zip. May I ask if the following log implies any hints of the locations of the memory leak, if there are ?
Loading extension module utils...
Time to load utils op: 0.338411808013916 seconds
Rank: 0 partition count [1] and sizes[(48330, False)]
[2022-09-10 02:29:09,959] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-09-10 02:29:09,959] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.04 GB Max_CA 0 GB
[2022-09-10 02:29:09,960] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,114] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states
[2022-09-10 02:29:10,114] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.04 GB Max_CA 0 GB
[2022-09-10 02:29:10,115] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,115] [INFO] [stage_1_and_2.py:516:__init__] optimizer state initialized
[2022-09-10 02:29:10,261] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2022-09-10 02:29:10,262] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.04 GB Max_CA 0 GB
[2022-09-10 02:29:10,262] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,267] [INFO] [config.py:987:print] DeepSpeedEngine configuration:
[2022-09-10 02:29:10,268] [INFO] [config.py:991:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
run_num = 0
[2022-09-10 02:29:10,422] [INFO] [utils.py:8
[gdas.ipynb.zip](https://github.com/microsoft/DeepSpeed/files/9539457/gdas.ipynb.zip)
27:see_memory_usage] memory usage before train_NN()
[2022-09-10 02:29:10,423] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.04 GB Max_CA 0 GB
[2022-09-10 02:29:10,423] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,748] [INFO] [logging.py:68:log_dist] [Rank 0] step=1, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,863] [INFO] [logging.py:68:log_dist] [Rank 0] step=2, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,988] [INFO] [logging.py:68:log_dist] [Rank 0] step=3, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,988] [INFO] [timer.py:207:stop] 0/3, RunningAvgSamplesPerSec=64.29764247567839, CurrSamplesPerSec=64.29764247567839, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,106] [INFO] [logging.py:68:log_dist] [Rank 0] step=4, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:11,107] [INFO] [timer.py:207:stop] 0/4, RunningAvgSamplesPerSec=66.00525611771185, CurrSamplesPerSec=67.80604576253033, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,225] [INFO] [logging.py:68:log_dist] [Rank 0] step=5, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:11,225] [INFO] [timer.py:207:stop] 0/5, RunningAvgSamplesPerSec=66.55657751974118, CurrSamplesPerSec=67.68731983531258, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,355] [INFO] [logging.py:68:log_dist] [Rank 0] step=6, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:54,910] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=21.507489154176568, CurrSamplesPerSec=12.836937339152996, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:55,527] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:55,528] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=21.496126966776316, CurrSamplesPerSec=12.963147028644192, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:56,185] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:56,186] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=21.482932504893128, CurrSamplesPerSec=12.168760408265106, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:56,993] [INFO] [utils.py:827:see_memory_usage] memory usage after train_NN()
[2022-09-10 02:36:56,993] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.1 GB CA 0.1 GB Max_CA 0 GB
[2022-09-10 02:36:56,994] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.19 GB, percent = 33.0%
Finished train_NN()
@tjruwase Sorry that I had put see_memory_usage for the wrong function call in the previous message.
It should be for train_architecture() instead of train_NN().
[2022-09-10 03:20:34,729] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=22.113168856352097, CurrSamplesPerSec=13.503006889446912, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:35,340] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 03:20:35,340] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=22.100969229749683, CurrSamplesPerSec=13.097596801095447, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:35,937] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 03:20:35,938] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=22.089474650856374, CurrSamplesPerSec=13.399278091939252, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:36,715] [INFO] [utils.py:827:see_memory_usage] memory usage after train_NN()
[2022-09-10 03:20:36,716] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.1 GB CA 0.1 GB Max_CA 0 GB
[2022-09-10 03:20:36,716] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.21 GB, percent = 33.2%
Finished train_NN()
[2022-09-10 03:20:36,847] [INFO] [utils.py:827:see_memory_usage] memory usage before train_architecture()
[2022-09-10 03:20:36,847] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.1 GB Max_CA 0 GB
[2022-09-10 03:20:36,847] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.21 GB, percent = 33.2%
Traceback (most recent call last):
File "gdas.py", line 962, in <module>
lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
File "gdas.py", line 783, in train_architecture
graph.forward(val_inputs, types="edge") # Lval(w*, alpha)
File "gdas.py", line 556, in forward
self.cells[c].forward(x, x1, x2, c, types=types)
File "gdas.py", line 442, in forward
self.nodes[n].forward(x, node_num=n, types=types) # Ltrain(w±, alpha)
File "gdas.py", line 320, in forward
y = self.connections[cc].forward(x, types)
File "gdas.py", line 277, in forward
edges_results = edges_results + self.edges[e].forward(x, types)
File "gdas.py", line 167, in forward
y_hat = self.forward_edge(x)
File "gdas.py", line 152, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[a31af820d2a8:01199] *** Process received signal ***
[a31af820d2a8:01199] Signal: Segmentation fault (11)
[a31af820d2a8:01199] Signal code: Address not mapped (1)
[a31af820d2a8:01199] Failing at address: 0x7f3db597220d
[a31af820d2a8:01199] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f3db8416980]
[a31af820d2a8:01199] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f3db8055775]
[a31af820d2a8:01199] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f3db88c0e44]
[a31af820d2a8:01199] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f3db8056605]
[a31af820d2a8:01199] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f3db88becb3]
[a31af820d2a8:01199] *** End of error message ***
Unfortunately, that is not very useful. Can you please add "memory_breakdown": true into your json config?
Attached : gdas.ipynb.zip
@tjruwase See the last column of the above ipynb code file which contains a very long log pertaining to "memory_breakdown": true
@tjruwase if you scroll to the very bottom of the ipynb file, the log output seems to imply that the gumbel function might had contributed to the issue, although I am not quite sure how this is possible. Is there a way to go around this ?
@tjruwase Could I understand that deepspeed is unable to further optimize the gumbel function ?
# self-defined initial NAS architecture, for supernet architecture edge weight training
def forward_edge(self, x):
self.__freeze_f()
self.__unfreeze_w()
# Refer to GDAS equations (5) and (6)
# if one_hot is already there, would summation be required given that all other entries are forced to 0 ?
# It's not required, but you don't know, which index is one hot encoded 1.
# https://pytorch.org/docs/stable/nn.functional.html#gumbel-softmax
# See also https://github.com/D-X-Y/AutoDL-Projects/issues/10#issuecomment-916619163
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
chosen_edge = torch.argmax(gumbel, dim=0) # converts one-hot encoding into integer
return chosen_edge
@buttercutter, I don't understand what gumbel is doing. So, you can please clarify your expectations for DeepSpeed in this case, and what you mean by "further optimize"? Thanks!
@tjruwase Sorry for the wording disambiguation, it should be memory offloading to cpu instead of further optimize.
Note: This is not some corporate project and it is just some homebrew project, so I am not asking for some deep involvement at your side. I just need to find out what is wrong with my own code, and if scenario requires, I might need your slight guidance in debugging the interface with my own code and deepspeed, or beyond.
[2022-09-10 03:20:36,847] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.21 GB, percent = 33.2%
Traceback (most recent call last):
File "gdas.py", line 962, in <module>
lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
File "gdas.py", line 783, in train_architecture
graph.forward(val_inputs, types="edge") # Lval(w*, alpha)
File "gdas.py", line 556, in forward
self.cells[c].forward(x, x1, x2, c, types=types)
File "gdas.py", line 442, in forward
self.nodes[n].forward(x, node_num=n, types=types) # Ltrain(w±, alpha)
File "gdas.py", line 320, in forward
y = self.connections[cc].forward(x, types)
File "gdas.py", line 277, in forward
edges_results = edges_results + self.edges[e].forward(x, types)
File "gdas.py", line 167, in forward
y_hat = self.forward_edge(x)
File "gdas.py", line 152, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch)
@tjruwase In my own understanding, gumbel-softmax is to allow the gradient to backpropagate across discrete domain.
See the difference between equations (3) and (5)

@buttercutter, I think the issue is that ZeRO does not optimize the memory consumption of activations. Can you try running with a train_micro_batch_size_per_gpu of 1 to test this? Also, assuming you are now comfortable with using see_memory_usage(), can add it just before the gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True) call?
@buttercutter, you might try activation checkpointing to address this if the smaller micro batch size works. Here are some docs
- PyTorch: https://pytorch.org/docs/stable/checkpoint.html
- DeepSpeed: https://deepspeed.readthedocs.io/en/latest/activation-checkpointing.html
@tjruwase if I use train_micro_batch_size_per_gpu of 1, I have the following error which had been spotted, reported and resolved earlier in another previous github issue
Traceback (most recent call last):
File "gdas.py", line 947, in <module>
ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
File "gdas.py", line 690, in train_NN
Ltrain = criterion(NN_output, NN_train_labels)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py", line 1166, in forward
label_smoothing=self.label_smoothing)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 3014, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (8) to match target batch_size (1).
@buttercutter, yes this is not surprising as it is indicating a mismatch between the batch size passed to client script via command line and the batch size in the ds_config. In the previous case, it made sense to modify the ds_config value of 1 to match the command line value of 8. But now, since we are trying to profile the memory behavior without encountering out-of-memory errors, so it makes sense to modify the command line value to 1 to match the ds_config. Does that make sense?
@tjruwase ok, I modified the value of batch_size for both the ds_config as well as the client script to 1 for testing purpose. So far, the GPU RAM memory usage stays around 4GB.
However, I am confused as in why would batch_size needs to be reduced in order to solve CUDA out-of-memory error when the primary purpose of deepspeed package is to offload RAM memory usage from GPU to CPU ?
@buttercutter, thanks for the update.
No, we are not solving the problem by reducing the batch_size rather we are trying to confirm whether memory bloat is one that ZeRO is designed to solve. There are two major sources of memory consumption in model training
- Model state: parameters, gradients, optimizer state
- Activations due to inputs
ZeRO is designed to solve (1) but not (2). For (2), you need to use activation checkpointing like the links I shared earlier. Here are some next steps that could share more insight:
- Measure impact of batch size (activations) on memory by trying batch size 2 and 4
- Use
see_memory_usage()to track memory utilization at vital points, such as before/after forward and before/after backward for the batch sizes that don't cause out-of-memory errors. - Enable activation checkpointing to see the impact
@tjruwase Regarding using checkpointing for trading compute with memory, should I only use the deepspeed version ?
and how would I perform checkpointing.configure() for this particular gumbel-max function : gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True) ?
as for the DeepSpeed Config json file, how shall I properly modify the following params ?
"activation_checkpointing": {
"partition_activations": false,
"cpu_checkpointing": false,
"contiguous_memory_optimization": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
@buttercutter, yes I recommend DeepSpeed's activation checkpointing because if supports offloading the activation inputs to cpu memory. In terms of enabling it, there are two parts involved.
- Wrap the forward function of the client modeling code with deepspeed's
checkpoint()as done in this file. In your case, I guess you want to wrap a caller of this. - Configure the deepspeed config json by trying out different settings. I can help with that later.
However, I think activation checkpointing is the last of the 3 steps that I proposed. In particular, how to effectively use it will depend on the findings from the first two steps, so I recommend completing those first. Also, activation checkpointing is not easy to implement so I suggest first testing out the Megatron-DeepSpeed implementation that I referenced earlier to understand how it works.
@tjruwase I used batch_size of 4, and it also resulted in memory usage of around 4GB with the help from see_memory_usage() on gumbel-softmax function, what do you suggest in this particular case before proceeding to using activation checkpointing ?
[2022-09-29 19:47:32,987] [INFO] [utils.py:827:see_memory_usage] memory usage before gumbel
[2022-09-29 19:47:32,988] [INFO] [utils.py:832:see_memory_usage] MA 0.61 GB Max_MA 0.61 GB CA 0.68 GB Max_CA 1 GB
[2022-09-29 19:47:32,988] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.44 GB, percent = 35.0%
[2022-09-29 19:47:33,142] [INFO] [utils.py:827:see_memory_usage] memory usage after gumbel
[2022-09-29 19:47:33,142] [INFO] [utils.py:832:see_memory_usage] MA 0.61 GB Max_MA 0.61 GB CA 0.68 GB Max_CA 1 GB
[2022-09-29 19:47:33,143] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.44 GB, percent = 35.0%
@tjruwase if I stick with batch_size of 4, the training process runs smoothly without cuda out-of-memory issue so far, but of course the training time is much longer compared to batch_size of 8, 16 or 32. So, I suppose it is an incentive to use activation checkpointing in this particular case ? could you advise on this ?
@tjruwase it is a bit strange that see_memory_usage() returned the same memory usage for both before and after gumbel_softmax() function
What do you suggest then ? See the following:
File "gdas.py", line 152, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch)
@buttercutter, yes it is quite mysterious to get the same memory usage before and after gumbel_softmax. This suggests that something (e.g., a kernel) is allocating memory outside the view of the cuda memory allocator. If you are willing to be adventurous, I suggest more finer-grained memory profiling by wrapping the following line with see_memory_usage
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
@tjruwase see the following log which seems not that useful for further debugging
[2022-11-09 04:45:52,214] [INFO] [utils.py:827:see_memory_usage] memory usage before variable ret
[2022-11-09 04:45:52,215] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB Max_MA 12.34 GB CA 13.71 GB Max_CA 14 GB
[2022-11-09 04:45:52,215] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.63 GB, percent = 36.5%
[2022-11-09 04:45:52,345] [INFO] [utils.py:827:see_memory_usage] memory usage after variable ret
[2022-11-09 04:45:52,346] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB Max_MA 12.34 GB CA 13.71 GB Max_CA 14 GB
[2022-11-09 04:45:52,346] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.63 GB, percent = 36.5%
[2022-11-09 04:45:52,484] [INFO] [utils.py:827:see_memory_usage] memory usage before variable ret
[2022-11-09 04:45:52,485] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB Max_MA 12.34 GB CA 13.71 GB Max_CA 14 GB
[2022-11-09 04:45:52,485] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.63 GB, percent = 36.5%
Traceback (most recent call last):
File "gdas.py", line 960, in <module>
lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
File "gdas.py", line 785, in train_architecture
graph.forward(val_inputs, types="edge") # Lval(w*, alpha)
File "gdas.py", line 558, in forward
self.cells[c].forward(x, x1, x2, c, types=types)
File "gdas.py", line 449, in forward
self.nodes[n].forward(x2, node_num=n, types=types) # Ltrain(w±, alpha)
File "gdas.py", line 322, in forward
y = self.connections[cc].forward(x, types)
File "gdas.py", line 279, in forward
edges_results = edges_results + self.edges[e].forward(x, types)
File "gdas.py", line 169, in forward
y_hat = self.forward_edge(x)
File "gdas.py", line 154, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1899, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 12.34 GiB already allocated; 15.75 MiB free; 13.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[3a98c6920156:00675] *** Process received signal ***
[3a98c6920156:00675] Signal: Segmentation fault (11)
[3a98c6920156:00675] Signal code: Address not mapped (1)
[3a98c6920156:00675] Failing at address: 0x7f7953ad920d
[3a98c6920156:00675] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f7956783980]
[3a98c6920156:00675] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f79563c2775]
[3a98c6920156:00675] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f7956c2de44]
[3a98c6920156:00675] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f79563c3605]
[3a98c6920156:00675] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f7956c2bcb3]
[3a98c6920156:00675] *** End of error message ***
The log snippet suggests that this code passes the 1st time but fails the 2nd time. Is that correct?