DeepSpeed DeepSpeed still gives CUDA-out-of-memory issue

trafficstars

May I know why this training code still gives CUDA-out-of-memory issue even after DeepSpeed is turned on ?

See this for historical tracking purpose.

Sep 07 '22 16:09 buttercutter

DeepSpeed's memory optimizations are enabled via ZeRO stages >=1 and most of them are effective in multi-gpu cases. Your log shows that ZeRO is not enabled.

Please see these tutorials

https://www.deepspeed.ai/tutorials/zero/
https://www.deepspeed.ai/tutorials/zero-offload/

Sep 07 '22 16:09 tjruwase

[2022-09-09 15:01:50,764] [INFO] [logging.py:68:log_dist] [Rank 0] step=1248, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:50,765] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=14.454011142357578, CurrSamplesPerSec=9.851701418075939, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
[2022-09-09 15:01:51,547] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:51,547] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=14.449232505063256, CurrSamplesPerSec=10.233603997501561, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
[2022-09-09 15:01:52,349] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:52,349] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=14.444045524931788, CurrSamplesPerSec=9.977597824784409, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
Traceback (most recent call last):
  File "gdas.py", line 947, in <module>
    ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
  File "gdas.py", line 737, in train_NN
    torch.save(graph, path)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 379, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 589, in _save
    pickler.dump(obj)
AttributeError: Can't pickle local object 'instrument_w_nvtx.<locals>.wrapped_fn'
[6eb301421b63:00906] *** Process received signal ***
[6eb301421b63:00906] Signal: Segmentation fault (11)
[6eb301421b63:00906] Signal code: Address not mapped (1)
[6eb301421b63:00906] Failing at address: 0x7f4e4f16420d
[6eb301421b63:00906] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f4e52011980]
[6eb301421b63:00906] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f4e51c50775]
[6eb301421b63:00906] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f4e524bbe44]
[6eb301421b63:00906] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f4e51c51605]
[6eb301421b63:00906] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f4e524b9cb3]
[6eb301421b63:00906] *** End of error message ***

I have updated the optimization pass as follows, however I am getting another different error related to pickler above.

{
    "train_micro_batch_size_per_gpu": 8,
    "steps_per_print": 1, 
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.05
        }
    },

    "zero_optimization": {
        "stage": 3,
        "contiguous_gradients": true,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_prefetch_bucket_size": 1e7,
        "stage3_param_persistence_threshold": 1e5,
        "reduce_bucket_size": 1e7,
        "sub_group_size": 1e9,
        "offload_optimizer": {
            "device": "cpu"
         },
        "offload_param": {
            "device": "cpu"
       }
   }
}

Sep 09 '22 15:09 buttercutter

@buttercutter, can you please try stage == 1? We can return to this stage 3 issue later.

Sep 09 '22 15:09 tjruwase

@tjruwase "stage": 1 resulted in the following out-of-memory issue

[2022-09-09 15:45:39,534] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=21.891309503059055, CurrSamplesPerSec=12.814082860076859, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-09 15:45:40,142] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:45:40,142] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=21.87964277609462, CurrSamplesPerSec=13.148494490509256, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-09 15:45:40,755] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:45:40,755] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=21.867808169940616, CurrSamplesPerSec=13.059329395683946, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
Finished train_NN()
Traceback (most recent call last):
  File "gdas.py", line 955, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
  File "gdas.py", line 780, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 553, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 439, in forward
    self.nodes[n].forward(x, node_num=n, types=types)  # Ltrain(w±, alpha)
  File "gdas.py", line 317, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 274, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 164, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 149, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[6eb301421b63:01205] *** Process received signal ***
[6eb301421b63:01205] Signal: Segmentation fault (11)
[6eb301421b63:01205] Signal code: Address not mapped (1)
[6eb301421b63:01205] Failing at address: 0x7f556779a20d
[6eb301421b63:01205] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f556a647980]
[6eb301421b63:01205] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f556a286775]
[6eb301421b63:01205] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f556aaf1e44]
[6eb301421b63:01205] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f556a287605]
[6eb301421b63:01205] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f556aaefcb3]
[6eb301421b63:01205] *** End of error message ***

Sep 09 '22 15:09 buttercutter

@buttercutter, got it. This out-of-memory with stage 1 is quite strange because the model seems to train for a while, completing multiple iterations, before failing. The compute and memory pattern of model training is typically regular so that any out-of-memory issues should show up within one or two iterations. Is it possible that you have a memory leak? Can you please use deepspeed memory profiler as discussed here? Thanks!

Sep 09 '22 16:09 tjruwase

@tjruwase From the code link you provided just above,

What is forward_step_func ? I suppose this is not just simply the model wrapper as it feeds in args.curriculum_learning ?
Do I actually need to use from megatron import get_timers ? Since this is a timer, does this actually log the memory profile for each backpropagation paths ?

Sep 10 '22 00:09 buttercutter

forward_step_func is just the forward pass call. Actually, it could be any function that are you interested in how it affects memory.
No, you don't need timers import from megatron. Sorry, I just noticed that see_memory_usage() is commented out in the code snippet.

Let's say you had a function foo, we can profile memory usage like below.

 from deepspeed.runtime.utils import see_memory_usage

 see_memory_usage(f'memory usage before foo', force=True)
 foo(...)
see_memory_usage(f'memory usage after foo', force=True)
`

Sep 10 '22 01:09 tjruwase

[phung@archlinux gdas]$ git status --short
 M ds_config.json
 M gdas.py
[phung@archlinux gdas]$ git diff gdas.py
diff --git a/gdas.py b/gdas.py
index 10c095c..e6a8b5c 100644
--- a/gdas.py
+++ b/gdas.py
@@ -13,6 +13,9 @@ import tensorflow as tf
 
 # import numpy as np
 
+VISUALIZER = 0
+DEBUG = 0
+
 # deepspeed zero offload https://www.deepspeed.ai/getting-started/
 # https://github.com/microsoft/DeepSpeed/issues/2029
 USE_DEEPSPEED = 1
@@ -21,8 +24,8 @@ if USE_DEEPSPEED:
     import argparse
     import deepspeed
 
-VISUALIZER = 0
-DEBUG = 0
+    from deepspeed.runtime.utils import see_memory_usage
+
 logdir = 'runs/gdas_experiment_1'
 
 if VISUALIZER:
@@ -944,7 +947,10 @@ if __name__ == "__main__":
     while not_converged:
         print("run_num = ", run_num)
 
+        see_memory_usage(f'memory usage before train_NN()', force=True)
         ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
+        see_memory_usage(f'memory usage after train_NN()', force=True)
+
         print("Finished train_NN()")
 
         if VISUALIZER or DEBUG:
[phung@archlinux gdas]$ 
[phung@archlinux gdas]$ git diff ds_config.json
diff --git a/ds_config.json b/ds_config.json
index 91943c3..4afc33d 100644
--- a/ds_config.json
+++ b/ds_config.json
@@ -1,10 +1,27 @@
 {
- "train_micro_batch_size_per_gpu": 8,
-"steps_per_print": 1, 
- "optimizer": {
+    "train_micro_batch_size_per_gpu": 8,
+    "steps_per_print": 1, 
+    "optimizer": {
         "type": "AdamW",
         "params": {
             "lr": 0.05
         }
-    }
+    },
+
+    "zero_optimization": {
+        "stage": 1,
+        "contiguous_gradients": true,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_prefetch_bucket_size": 1e7,
+        "stage3_param_persistence_threshold": 1e5,
+        "reduce_bucket_size": 1e7,
+        "sub_group_size": 1e9,
+        "offload_optimizer": {
+            "device": "cpu"
+         },
+        "offload_param": {
+            "device": "cpu"
+       }
+   }
 }
[phung@archlinux gdas]$

@tjruwase With the above modifications to gdas.py training source code as well as the deepspeed json config file, I have the following terminal output log which is contained inside gdas.ipynb.zip. May I ask if the following log implies any hints of the locations of the memory leak, if there are ?

Loading extension module utils...
Time to load utils op: 0.338411808013916 seconds
Rank: 0 partition count [1] and sizes[(48330, False)] 
[2022-09-10 02:29:09,959] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-09-10 02:29:09,959] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.04 GB         Max_CA 0 GB 
[2022-09-10 02:29:09,960] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,114] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states
[2022-09-10 02:29:10,114] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.04 GB         Max_CA 0 GB 
[2022-09-10 02:29:10,115] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,115] [INFO] [stage_1_and_2.py:516:__init__] optimizer state initialized
[2022-09-10 02:29:10,261] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2022-09-10 02:29:10,262] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.04 GB         Max_CA 0 GB 
[2022-09-10 02:29:10,262] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,267] [INFO] [config.py:987:print] DeepSpeedEngine configuration:
[2022-09-10 02:29:10,268] [INFO] [config.py:991:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}

run_num =  0
[2022-09-10 02:29:10,422] [INFO] [utils.py:8
[gdas.ipynb.zip](https://github.com/microsoft/DeepSpeed/files/9539457/gdas.ipynb.zip)
27:see_memory_usage] memory usage before train_NN()
[2022-09-10 02:29:10,423] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.04 GB         Max_CA 0 GB 
[2022-09-10 02:29:10,423] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,748] [INFO] [logging.py:68:log_dist] [Rank 0] step=1, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,863] [INFO] [logging.py:68:log_dist] [Rank 0] step=2, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,988] [INFO] [logging.py:68:log_dist] [Rank 0] step=3, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,988] [INFO] [timer.py:207:stop] 0/3, RunningAvgSamplesPerSec=64.29764247567839, CurrSamplesPerSec=64.29764247567839, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,106] [INFO] [logging.py:68:log_dist] [Rank 0] step=4, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:11,107] [INFO] [timer.py:207:stop] 0/4, RunningAvgSamplesPerSec=66.00525611771185, CurrSamplesPerSec=67.80604576253033, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,225] [INFO] [logging.py:68:log_dist] [Rank 0] step=5, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:11,225] [INFO] [timer.py:207:stop] 0/5, RunningAvgSamplesPerSec=66.55657751974118, CurrSamplesPerSec=67.68731983531258, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,355] [INFO] [logging.py:68:log_dist] [Rank 0] step=6, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]

[2022-09-10 02:36:54,910] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=21.507489154176568, CurrSamplesPerSec=12.836937339152996, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:55,527] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:55,528] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=21.496126966776316, CurrSamplesPerSec=12.963147028644192, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:56,185] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:56,186] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=21.482932504893128, CurrSamplesPerSec=12.168760408265106, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:56,993] [INFO] [utils.py:827:see_memory_usage] memory usage after train_NN()
[2022-09-10 02:36:56,993] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.1 GB         CA 0.1 GB         Max_CA 0 GB 
[2022-09-10 02:36:56,994] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.19 GB, percent = 33.0%
Finished train_NN()

Sep 10 '22 02:09 buttercutter

@tjruwase Sorry that I had put see_memory_usage for the wrong function call in the previous message.
It should be for train_architecture() instead of train_NN().

[2022-09-10 03:20:34,729] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=22.113168856352097, CurrSamplesPerSec=13.503006889446912, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:35,340] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 03:20:35,340] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=22.100969229749683, CurrSamplesPerSec=13.097596801095447, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:35,937] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 03:20:35,938] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=22.089474650856374, CurrSamplesPerSec=13.399278091939252, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:36,715] [INFO] [utils.py:827:see_memory_usage] memory usage after train_NN()
[2022-09-10 03:20:36,716] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.1 GB         CA 0.1 GB         Max_CA 0 GB 
[2022-09-10 03:20:36,716] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.21 GB, percent = 33.2%
Finished train_NN()
[2022-09-10 03:20:36,847] [INFO] [utils.py:827:see_memory_usage] memory usage before train_architecture()
[2022-09-10 03:20:36,847] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.1 GB         Max_CA 0 GB 
[2022-09-10 03:20:36,847] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.21 GB, percent = 33.2%
Traceback (most recent call last):
  File "gdas.py", line 962, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
  File "gdas.py", line 783, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 556, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 442, in forward
    self.nodes[n].forward(x, node_num=n, types=types)  # Ltrain(w±, alpha)
  File "gdas.py", line 320, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 277, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 167, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 152, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[a31af820d2a8:01199] *** Process received signal ***
[a31af820d2a8:01199] Signal: Segmentation fault (11)
[a31af820d2a8:01199] Signal code: Address not mapped (1)
[a31af820d2a8:01199] Failing at address: 0x7f3db597220d
[a31af820d2a8:01199] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f3db8416980]
[a31af820d2a8:01199] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f3db8055775]
[a31af820d2a8:01199] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f3db88c0e44]
[a31af820d2a8:01199] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f3db8056605]
[a31af820d2a8:01199] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f3db88becb3]
[a31af820d2a8:01199] *** End of error message ***

Sep 10 '22 03:09 buttercutter

Unfortunately, that is not very useful. Can you please add "memory_breakdown": true into your json config?

Sep 10 '22 11:09 tjruwase

Attached : gdas.ipynb.zip

@tjruwase See the last column of the above ipynb code file which contains a very long log pertaining to "memory_breakdown": true

Sep 11 '22 01:09 buttercutter

@tjruwase if you scroll to the very bottom of the ipynb file, the log output seems to imply that the gumbel function might had contributed to the issue, although I am not quite sure how this is possible. Is there a way to go around this ?

Sep 11 '22 03:09 buttercutter

@tjruwase Could I understand that deepspeed is unable to further optimize the gumbel function ?

    # self-defined initial NAS architecture, for supernet architecture edge weight training
    def forward_edge(self, x):
        self.__freeze_f()
        self.__unfreeze_w()

        # Refer to GDAS equations (5) and (6)
        # if one_hot is already there, would summation be required given that all other entries are forced to 0 ?
        # It's not required, but you don't know, which index is one hot encoded 1.
        # https://pytorch.org/docs/stable/nn.functional.html#gumbel-softmax
        # See also https://github.com/D-X-Y/AutoDL-Projects/issues/10#issuecomment-916619163

        gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
        chosen_edge = torch.argmax(gumbel, dim=0)  # converts one-hot encoding into integer

        return chosen_edge

Sep 25 '22 00:09 buttercutter

@buttercutter, I don't understand what gumbel is doing. So, you can please clarify your expectations for DeepSpeed in this case, and what you mean by "further optimize"? Thanks!

Sep 26 '22 11:09 tjruwase

@tjruwase Sorry for the wording disambiguation, it should be memory offloading to cpu instead of further optimize.

Note: This is not some corporate project and it is just some homebrew project, so I am not asking for some deep involvement at your side. I just need to find out what is wrong with my own code, and if scenario requires, I might need your slight guidance in debugging the interface with my own code and deepspeed, or beyond.

[2022-09-10 03:20:36,847] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.21 GB, percent = 33.2%
Traceback (most recent call last):
  File "gdas.py", line 962, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
  File "gdas.py", line 783, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 556, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 442, in forward
    self.nodes[n].forward(x, node_num=n, types=types)  # Ltrain(w±, alpha)
  File "gdas.py", line 320, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 277, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 167, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 152, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch)

Sep 26 '22 22:09 buttercutter

@tjruwase In my own understanding, gumbel-softmax is to allow the gradient to backpropagate across discrete domain.

See the difference between equations (3) and (5)

Sep 26 '22 22:09 buttercutter

@buttercutter, I think the issue is that ZeRO does not optimize the memory consumption of activations. Can you try running with a train_micro_batch_size_per_gpu of 1 to test this? Also, assuming you are now comfortable with using see_memory_usage(), can add it just before the gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True) call?

Sep 26 '22 23:09 tjruwase

@buttercutter, you might try activation checkpointing to address this if the smaller micro batch size works. Here are some docs

PyTorch: https://pytorch.org/docs/stable/checkpoint.html
DeepSpeed: https://deepspeed.readthedocs.io/en/latest/activation-checkpointing.html

Sep 26 '22 23:09 tjruwase

@tjruwase if I use train_micro_batch_size_per_gpu of 1, I have the following error which had been spotted, reported and resolved earlier in another previous github issue

Traceback (most recent call last):
  File "gdas.py", line 947, in <module>
    ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
  File "gdas.py", line 690, in train_NN
    Ltrain = criterion(NN_output, NN_train_labels)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py", line 1166, in forward
    label_smoothing=self.label_smoothing)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (8) to match target batch_size (1).

Sep 27 '22 14:09 buttercutter

@buttercutter, yes this is not surprising as it is indicating a mismatch between the batch size passed to client script via command line and the batch size in the ds_config. In the previous case, it made sense to modify the ds_config value of 1 to match the command line value of 8. But now, since we are trying to profile the memory behavior without encountering out-of-memory errors, so it makes sense to modify the command line value to 1 to match the ds_config. Does that make sense?

Sep 27 '22 14:09 tjruwase

@tjruwase ok, I modified the value of batch_size for both the ds_config as well as the client script to 1 for testing purpose. So far, the GPU RAM memory usage stays around 4GB.

However, I am confused as in why would batch_size needs to be reduced in order to solve CUDA out-of-memory error when the primary purpose of deepspeed package is to offload RAM memory usage from GPU to CPU ?

Sep 27 '22 14:09 buttercutter

@buttercutter, thanks for the update.

No, we are not solving the problem by reducing the batch_size rather we are trying to confirm whether memory bloat is one that ZeRO is designed to solve. There are two major sources of memory consumption in model training

Model state: parameters, gradients, optimizer state
Activations due to inputs

ZeRO is designed to solve (1) but not (2). For (2), you need to use activation checkpointing like the links I shared earlier. Here are some next steps that could share more insight:

Measure impact of batch size (activations) on memory by trying batch size 2 and 4
Use see_memory_usage() to track memory utilization at vital points, such as before/after forward and before/after backward for the batch sizes that don't cause out-of-memory errors.
Enable activation checkpointing to see the impact

Sep 27 '22 15:09 tjruwase

@tjruwase Regarding using checkpointing for trading compute with memory, should I only use the deepspeed version ?

and how would I perform checkpointing.configure() for this particular gumbel-max function : gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True) ?

as for the DeepSpeed Config json file, how shall I properly modify the following params ?

 "activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
    }

Sep 29 '22 23:09 buttercutter

@buttercutter, yes I recommend DeepSpeed's activation checkpointing because if supports offloading the activation inputs to cpu memory. In terms of enabling it, there are two parts involved.

Wrap the forward function of the client modeling code with deepspeed's checkpoint() as done in this file. In your case, I guess you want to wrap a caller of this.
Configure the deepspeed config json by trying out different settings. I can help with that later.

However, I think activation checkpointing is the last of the 3 steps that I proposed. In particular, how to effectively use it will depend on the findings from the first two steps, so I recommend completing those first. Also, activation checkpointing is not easy to implement so I suggest first testing out the Megatron-DeepSpeed implementation that I referenced earlier to understand how it works.

Sep 30 '22 10:09 tjruwase

@tjruwase I used batch_size of 4, and it also resulted in memory usage of around 4GB with the help from see_memory_usage() on gumbel-softmax function, what do you suggest in this particular case before proceeding to using activation checkpointing ?

[2022-09-29 19:47:32,987] [INFO] [utils.py:827:see_memory_usage] memory usage before gumbel
[2022-09-29 19:47:32,988] [INFO] [utils.py:832:see_memory_usage] MA 0.61 GB         Max_MA 0.61 GB         CA 0.68 GB         Max_CA 1 GB 
[2022-09-29 19:47:32,988] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.44 GB, percent = 35.0%
[2022-09-29 19:47:33,142] [INFO] [utils.py:827:see_memory_usage] memory usage after gumbel
[2022-09-29 19:47:33,142] [INFO] [utils.py:832:see_memory_usage] MA 0.61 GB         Max_MA 0.61 GB         CA 0.68 GB         Max_CA 1 GB 
[2022-09-29 19:47:33,143] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.44 GB, percent = 35.0%

Sep 30 '22 11:09 buttercutter

@tjruwase if I stick with batch_size of 4, the training process runs smoothly without cuda out-of-memory issue so far, but of course the training time is much longer compared to batch_size of 8, 16 or 32. So, I suppose it is an incentive to use activation checkpointing in this particular case ? could you advise on this ?

Sep 30 '22 12:09 buttercutter

@tjruwase it is a bit strange that see_memory_usage() returned the same memory usage for both before and after gumbel_softmax() function

What do you suggest then ? See the following:

File "gdas.py", line 152, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch)

Oct 15 '22 00:10 buttercutter

@buttercutter, yes it is quite mysterious to get the same memory usage before and after gumbel_softmax. This suggests that something (e.g., a kernel) is allocating memory outside the view of the cuda memory allocator. If you are willing to be adventurous, I suggest more finer-grained memory profiling by wrapping the following line with see_memory_usage

  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft

Nov 07 '22 13:11 tjruwase

@tjruwase see the following log which seems not that useful for further debugging


[2022-11-09 04:45:52,214] [INFO] [utils.py:827:see_memory_usage] memory usage before variable ret
[2022-11-09 04:45:52,215] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB         Max_MA 12.34 GB         CA 13.71 GB         Max_CA 14 GB 
[2022-11-09 04:45:52,215] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.63 GB, percent = 36.5%
[2022-11-09 04:45:52,345] [INFO] [utils.py:827:see_memory_usage] memory usage after variable ret
[2022-11-09 04:45:52,346] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB         Max_MA 12.34 GB         CA 13.71 GB         Max_CA 14 GB 
[2022-11-09 04:45:52,346] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.63 GB, percent = 36.5%
[2022-11-09 04:45:52,484] [INFO] [utils.py:827:see_memory_usage] memory usage before variable ret
[2022-11-09 04:45:52,485] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB         Max_MA 12.34 GB         CA 13.71 GB         Max_CA 14 GB 
[2022-11-09 04:45:52,485] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.63 GB, percent = 36.5%
Traceback (most recent call last):
  File "gdas.py", line 960, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val=&#039;val&#039;)
  File "gdas.py", line 785, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 558, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 449, in forward
    self.nodes[n].forward(x2, node_num=n, types=types)  # Ltrain(w±, alpha)
  File "gdas.py", line 322, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 279, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 169, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 154, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1899, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 12.34 GiB already allocated; 15.75 MiB free; 13.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[3a98c6920156:00675] *** Process received signal ***
[3a98c6920156:00675] Signal: Segmentation fault (11)
[3a98c6920156:00675] Signal code: Address not mapped (1)
[3a98c6920156:00675] Failing at address: 0x7f7953ad920d
[3a98c6920156:00675] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f7956783980]
[3a98c6920156:00675] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f79563c2775]
[3a98c6920156:00675] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f7956c2de44]
[3a98c6920156:00675] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f79563c3605]
[3a98c6920156:00675] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f7956c2bcb3]
[3a98c6920156:00675] *** End of error message ***

Nov 09 '22 07:11 buttercutter

The log snippet suggests that this code passes the 1st time but fails the 2nd time. Is that correct?

Nov 09 '22 11:11 tjruwase

DeepSpeed DeepSpeed copied to clipboard

DeepSpeed still gives CUDA-out-of-memory issue

DeepSpeed
DeepSpeed copied to clipboard