DeepSpeed [BUG] cuda OOM on loading opt-66B

Describe the bug Hello, I got OOM to load the facebook/opt-66b to GPUs (upto 96 A100-80) using zero3. I suspect that the model is not divided correctly. I use other tools like colossalAI and able to load the model on 16 A100. Appreciate it if anyone have any suggestions about it. I found a similar issue but regarding the inferencing for opt-30. https://github.com/microsoft/DeepSpeed/issues/2520

Config

01/06/2023 03:56:15 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 24
Process index: 8
Local process index: 2
Device: cuda:2
ds_config: {'train_batch_size': 96, 'train_micro_batch_size_per_gpu': 4, 'gradient_accumulation_steps': 1, 
'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'cpu'}, 'offload_param': {'device': 'cpu'}, 
'stage3_gather_16bit_weights_on_model_save': False}, 'steps_per_print': inf, 'fp16': {'enabled': True, 
'initial_scale_power': 10}, 'zero_allow_untested_optimizer': True}

ds_report output Please run ds_report to give us details about your setup.

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.7/dist-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.7/dist-packages/deepspeed']
deepspeed info ................... 0.7.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3

Traceback

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ run_clm_no_trainer.py:664 in        │
│ <module>                                                                     │
│                                                                              │
│   661                                                                        │
│   662                                                                        │
│   663 if __name__ == "__main__":                                             │
│ ❱ 664 │   main()                                                             │
│   665                                                                        │
│                                                                              │
│ /opt/tiger/ByteBM-Training/OPT-Benchmark/run_clm_no_trainer.py:505 in main   │
│                                                                              │
│   502 │                                                                      │
│   503 │   # Prepare everything with our `accelerator`.                       │
│   504 │   model, optimizer, train_dataloader, eval_dataloader, lr_scheduler  │
│ ❱ 505 │   │   model, optimizer, train_dataloader, eval_dataloader, lr_schedu │
│   506 │   )                                                                  │
│   507 │                                                                      │
│   508 │   # We need to recalculate our total training steps as the size of t │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:872 in      │
│ prepare                                                                      │
│                                                                              │
│    869 │   │   │   old_named_params = self._get_named_parameters(*args)      │
│    870 │   │                                                                 │
│    871 │   │   if self.distributed_type == DistributedType.DEEPSPEED:        │
│ ❱  872 │   │   │   result = self._prepare_deepspeed(*args)                   │
│    873 │   │   elif self.distributed_type == DistributedType.MEGATRON_LM:    │
│    874 │   │   │   result = self._prepare_megatron_lm(*args)                 │
│    875 │   │   else:                                                         │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:1093 in     │
│ _prepare_deepspeed                                                           │
│                                                                              │
│   1090 │   │   │   │   │   │   if type(scheduler).__name__ in deepspeed.runt │
│   1091 │   │   │   │   │   │   │   kwargs["lr_scheduler"] = scheduler        │
│   1092 │   │   │                                                             │
│ ❱ 1093 │   │   │   engine, optimizer, _, lr_scheduler = deepspeed.initialize │
│   1094 │   │   │   if optimizer is not None:                                 │
│   1095 │   │   │   │   optimizer = DeepSpeedOptimizerWrapper(optimizer)      │
│   1096 │   │   │   if scheduler is not None:                                 │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/deepspeed/__init__.py:135 in          │
│ initialize                                                                   │
│                                                                              │
│   132 │   │   │   │   │   │   │   │    dist_init_required=dist_init_required │
│   133 │   │   │   │   │   │   │   │    collate_fn=collate_fn,                │
│   134 │   │   │   │   │   │   │   │    config=config,                        │
│ ❱ 135 │   │   │   │   │   │   │   │    config_params=config_params)          │
│   136 │   else:                                                              │
│   137 │   │   assert mpu is None, "mpu must be None with pipeline parallelis │
│   138 │   │   engine = PipelineEngine(args=args,                             │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/deepspeed/runtime/engine.py:290 in    │
│ __init__                                                                     │
│                                                                              │
│    287 │   │   self.pipeline_parallelism = isinstance(model, PipelineModule) │
│    288 │   │                                                                 │
│    289 │   │   # Configure distributed model                                 │
│ ❱  290 │   │   self._configure_distributed_model(model)                      │
│    291 │   │                                                                 │
│    292 │   │   self._get_model_parameters()                                  │
│    293                                                                       │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/deepspeed/runtime/engine.py:1070 in   │
│ _configure_distributed_model                                                 │
│                                                                              │
│   1067 │   │   │   self.__check_params(self.module, torch.float)             │
│   1068 │   │                                                                 │
│   1069 │   │   if not self.dont_change_device:                               │
│ ❱ 1070 │   │   │   self.module.to(self.device)                               │
│   1071 │   │                                                                 │
│   1072 │   │   # MoE related initialization                                  │
│   1073 │   │   for _, module in self.module.named_modules():                 │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py:1682   │
│ in to                                                                        │
│                                                                              │
│   1679 │   │   │   │   " model has already been set to the correct devices a │
│   1680 │   │   │   )                                                         │
│   1681 │   │   else:                                                         │
│ ❱ 1682 │   │   │   return super().to(*args, **kwargs)                        │
│   1683 │                                                                     │
│   1684 │   def half(self, *args):                                            │
│   1685 │   │   # Checks if the model has been loaded in 8-bit                │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:907 in to  │
│                                                                              │
│    904 │   │   │   │   │   │   │   non_blocking, memory_format=convert_to_fo │
│    905 │   │   │   return t.to(device, dtype if t.is_floating_point() or t.i │
│    906 │   │                                                                 │
│ ❱  907 │   │   return self._apply(convert)                                   │
│    908 │                                                                     │
│    909 │   def register_backward_hook(                                       │
│    910 │   │   self, hook: Callable[['Module', _grad_t, _grad_t], Union[None │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:578 in     │
│ _apply                                                                       │
│                                                                              │
│    575 │                                                                     │
│    576 │   def _apply(self, fn):                                             │
│    577 │   │   for module in self.children():                                │
│ ❱  578 │   │   │   module._apply(fn)                                         │
│    579 │   │                                                                 │
│    580 │   │   def compute_should_use_set_data(tensor, tensor_applied):      │
│    581 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:578 in     │
│ _apply                                                                       │
│                                                                              │
│    575 │                                                                     │
│    576 │   def _apply(self, fn):                                             │
│    577 │   │   for module in self.children():                                │
│ ❱  578 │   │   │   module._apply(fn)                                         │
│    579 │   │                                                                 │
│    580 │   │   def compute_should_use_set_data(tensor, tensor_applied):      │
│    581 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:578 in     │
│ _apply                                                                       │
│                                                                              │
│    575 │                                                                     │
│    576 │   def _apply(self, fn):                                             │
│    577 │   │   for module in self.children():                                │
│ ❱  578 │   │   │   module._apply(fn)                                         │
│    579 │   │                                                                 │
│    580 │   │   def compute_should_use_set_data(tensor, tensor_applied):      │
│    581 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:578 in     │
│ _apply                                                                       │
│                                                                              │
│    575 │                                                                     │
│    576 │   def _apply(self, fn):                                             │
│    577 │   │   for module in self.children():                                │
│ ❱  578 │   │   │   module._apply(fn)                                         │
│    579 │   │                                                                 │
│    580 │   │   def compute_should_use_set_data(tensor, tensor_applied):      │
│    581 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:578 in     │
│ _apply                                                                       │
│                                                                              │
│    575 │                                                                     │
│    576 │   def _apply(self, fn):                                             │
│    577 │   │   for module in self.children():                                │
│ ❱  578 │   │   │   module._apply(fn)                                         │
│    579 │   │                                                                 │
│    580 │   │   def compute_should_use_set_data(tensor, tensor_applied):      │
│    581 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:578 in     │
│ _apply                                                                       │
│                                                                              │
│    575 │                                                                     │
│    576 │   def _apply(self, fn):                                             │
│    577 │   │   for module in self.children():                                │
│ ❱  578 │   │   │   module._apply(fn)                                         │
│    579 │   │                                                                 │
│    580 │   │   def compute_should_use_set_data(tensor, tensor_applied):      │
│    581 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:601 in     │
│ _apply                                                                       │
│                                                                              │
│    598 │   │   │   # track autograd history of `param_applied`, so we have t │
│    599 │   │   │   # `with torch.no_grad():`                                 │
│    600 │   │   │   with torch.no_grad():                                     │
│ ❱  601 │   │   │   │   param_applied = fn(param)                             │
│    602 │   │   │   should_use_set_data = compute_should_use_set_data(param,  │
│    603 │   │   │   if should_use_set_data:                                   │
│    604 │   │   │   │   param.data = param_applied                            │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:905 in     │
│ convert                                                                      │
│                                                                              │
│    902 │   │   │   if convert_to_format is not None and t.dim() in (4, 5):   │
│    903 │   │   │   │   return t.to(device, dtype if t.is_floating_point() or │
│    904 │   │   │   │   │   │   │   non_blocking, memory_format=convert_to_fo │
│ ❱  905 │   │   │   return t.to(device, dtype if t.is_floating_point() or t.i │
│    906 │   │                                                                 │
│    907 │   │   return self._apply(convert)                                   │
│    908                                                                       │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA out of memory. Tried to allocate 162.00 MiB (GPU 2; 79.35 GiB
total capacity; 77.00 GiB already allocated; 146.19 MiB free; 77.01 GiB reserved
in total by PyTorch) If reserved memory is >> allocated memory try setting 
max_split_size_mb to avoid fragmentation.  See documentation for Memory 
Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2413 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2415 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2412) of binary: /usr/bin/python3
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'n176-051-158.byted.org_2189_0' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousTimeoutError.
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Jan 05 '23 16:01 MikeChenfu

@MikeChenfu, can you please clarify if you are loading the checkpoint for inference, finetuning, or continued training?

Jan 06 '23 19:01 tjruwase

Thanks @tjruwase for reply. Actually I enable the checkpoint for training. But the OOM problem occurs at the preparation stage.

run_clm_no_trainer.py:664 in        │
│ <module>                                                                     │
│                                                                              │
│   661                                                                        │
│   662                                                                        │
│   663 if __name__ == "__main__":                                             │
│ ❱ 664 │   main()                                                             │
│   665                                                                        │
│                                                                              │
│ /opt/tiger/ByteBM-Training/OPT-Benchmark/run_clm_no_trainer.py:505 in main   │
│                                                                              │
│   502 │                                                                      │
│   503 │   # Prepare everything with our `accelerator`.                       │
│   504 │   model, optimizer, train_dataloader, eval_dataloader, lr_scheduler  │
│ ❱ 505 │   │   model, optimizer, train_dataloader, eval_dataloader, lr_schedu │
│   506 │   )                                                                  │
│   507 │                                                                      │
│   508 │   # We need to recalculate our total training steps as the size of t │
│                                                                              │
│ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:872 in      │
│ prepare

Jan 06 '23 19:01 MikeChenfu

@MikeChenfu It may be that the ZeRO 3 context manager is not being used correctly. Can you try initializing the deepspeed config through HuggingFace as below specifically line 6:

from transformers.deepspeed import HfDeepSpeedConfig

model_name = "facebook/opt-66b"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
config = AutoConfig.from_pretrained(model_name)
dschf = HfDeepSpeedConfig(ds_config)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
model, optimizer, _, _ = deepspeed.initialize(args=None, model=model,
                    model_parameters=model.parameters(), config=ds_config)

Feb 21 '23 18:02 jomayeri

Thanks @jomayeri. I will try it.:)

Feb 21 '23 18:02 MikeChenfu

@MikeChenfu I am going to close this issue for now. Feel free to reopen if there are further developments.

Mar 02 '23 18:03 jomayeri

DeepSpeed DeepSpeed copied to clipboard

[BUG] cuda OOM on loading opt-66B

DeepSpeed
DeepSpeed copied to clipboard