DeepSpeed
DeepSpeed copied to clipboard
[BUG] cuda OOM on loading opt-66B
Describe the bug
Hello, I got OOM to load the facebook/opt-66b
to GPUs (upto 96 A100-80) using zero3. I suspect that the model is not divided correctly.
I use other tools like colossalAI and able to load the model on 16 A100. Appreciate it if anyone have any suggestions about it.
I found a similar issue but regarding the inferencing for opt-30.
https://github.com/microsoft/DeepSpeed/issues/2520
Config
01/06/2023 03:56:15 - INFO - __main__ - Distributed environment: DEEPSPEED Backend: nccl
Num processes: 24
Process index: 8
Local process index: 2
Device: cuda:2
ds_config: {'train_batch_size': 96, 'train_micro_batch_size_per_gpu': 4, 'gradient_accumulation_steps': 1,
'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'cpu'}, 'offload_param': {'device': 'cpu'},
'stage3_gather_16bit_weights_on_model_save': False}, 'steps_per_print': inf, 'fp16': {'enabled': True,
'initial_scale_power': 10}, 'zero_allow_untested_optimizer': True}
ds_report output
Please run ds_report
to give us details about your setup.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.7/dist-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.7/dist-packages/deepspeed']
deepspeed info ................... 0.7.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3
Traceback
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ run_clm_no_trainer.py:664 in │
│ <module> │
│ │
│ 661 │
│ 662 │
│ 663 if __name__ == "__main__": │
│ ❱ 664 │ main() │
│ 665 │
│ │
│ /opt/tiger/ByteBM-Training/OPT-Benchmark/run_clm_no_trainer.py:505 in main │
│ │
│ 502 │ │
│ 503 │ # Prepare everything with our `accelerator`. │
│ 504 │ model, optimizer, train_dataloader, eval_dataloader, lr_scheduler │
│ ❱ 505 │ │ model, optimizer, train_dataloader, eval_dataloader, lr_schedu │
│ 506 │ ) │
│ 507 │ │
│ 508 │ # We need to recalculate our total training steps as the size of t │
│ │
│ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:872 in │
│ prepare │
│ │
│ 869 │ │ │ old_named_params = self._get_named_parameters(*args) │
│ 870 │ │ │
│ 871 │ │ if self.distributed_type == DistributedType.DEEPSPEED: │
│ ❱ 872 │ │ │ result = self._prepare_deepspeed(*args) │
│ 873 │ │ elif self.distributed_type == DistributedType.MEGATRON_LM: │
│ 874 │ │ │ result = self._prepare_megatron_lm(*args) │
│ 875 │ │ else: │
│ │
│ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:1093 in │
│ _prepare_deepspeed │
│ │
│ 1090 │ │ │ │ │ │ if type(scheduler).__name__ in deepspeed.runt │
│ 1091 │ │ │ │ │ │ │ kwargs["lr_scheduler"] = scheduler │
│ 1092 │ │ │ │
│ ❱ 1093 │ │ │ engine, optimizer, _, lr_scheduler = deepspeed.initialize │
│ 1094 │ │ │ if optimizer is not None: │
│ 1095 │ │ │ │ optimizer = DeepSpeedOptimizerWrapper(optimizer) │
│ 1096 │ │ │ if scheduler is not None: │
│ │
│ /usr/local/lib/python3.7/dist-packages/deepspeed/__init__.py:135 in │
│ initialize │
│ │
│ 132 │ │ │ │ │ │ │ │ dist_init_required=dist_init_required │
│ 133 │ │ │ │ │ │ │ │ collate_fn=collate_fn, │
│ 134 │ │ │ │ │ │ │ │ config=config, │
│ ❱ 135 │ │ │ │ │ │ │ │ config_params=config_params) │
│ 136 │ else: │
│ 137 │ │ assert mpu is None, "mpu must be None with pipeline parallelis │
│ 138 │ │ engine = PipelineEngine(args=args, │
│ │
│ /usr/local/lib/python3.7/dist-packages/deepspeed/runtime/engine.py:290 in │
│ __init__ │
│ │
│ 287 │ │ self.pipeline_parallelism = isinstance(model, PipelineModule) │
│ 288 │ │ │
│ 289 │ │ # Configure distributed model │
│ ❱ 290 │ │ self._configure_distributed_model(model) │
│ 291 │ │ │
│ 292 │ │ self._get_model_parameters() │
│ 293 │
│ │
│ /usr/local/lib/python3.7/dist-packages/deepspeed/runtime/engine.py:1070 in │
│ _configure_distributed_model │
│ │
│ 1067 │ │ │ self.__check_params(self.module, torch.float) │
│ 1068 │ │ │
│ 1069 │ │ if not self.dont_change_device: │
│ ❱ 1070 │ │ │ self.module.to(self.device) │
│ 1071 │ │ │
│ 1072 │ │ # MoE related initialization │
│ 1073 │ │ for _, module in self.module.named_modules(): │
│ │
│ /usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py:1682 │
│ in to │
│ │
│ 1679 │ │ │ │ " model has already been set to the correct devices a │
│ 1680 │ │ │ ) │
│ 1681 │ │ else: │
│ ❱ 1682 │ │ │ return super().to(*args, **kwargs) │
│ 1683 │ │
│ 1684 │ def half(self, *args): │
│ 1685 │ │ # Checks if the model has been loaded in 8-bit │
│ │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:907 in to │
│ │
│ 904 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_fo │
│ 905 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.i │
│ 906 │ │ │
│ ❱ 907 │ │ return self._apply(convert) │
│ 908 │ │
│ 909 │ def register_backward_hook( │
│ 910 │ │ self, hook: Callable[['Module', _grad_t, _grad_t], Union[None │
│ │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:578 in │
│ _apply │
│ │
│ 575 │ │
│ 576 │ def _apply(self, fn): │
│ 577 │ │ for module in self.children(): │
│ ❱ 578 │ │ │ module._apply(fn) │
│ 579 │ │ │
│ 580 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 581 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:578 in │
│ _apply │
│ │
│ 575 │ │
│ 576 │ def _apply(self, fn): │
│ 577 │ │ for module in self.children(): │
│ ❱ 578 │ │ │ module._apply(fn) │
│ 579 │ │ │
│ 580 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 581 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:578 in │
│ _apply │
│ │
│ 575 │ │
│ 576 │ def _apply(self, fn): │
│ 577 │ │ for module in self.children(): │
│ ❱ 578 │ │ │ module._apply(fn) │
│ 579 │ │ │
│ 580 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 581 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:578 in │
│ _apply │
│ │
│ 575 │ │
│ 576 │ def _apply(self, fn): │
│ 577 │ │ for module in self.children(): │
│ ❱ 578 │ │ │ module._apply(fn) │
│ 579 │ │ │
│ 580 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 581 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:578 in │
│ _apply │
│ │
│ 575 │ │
│ 576 │ def _apply(self, fn): │
│ 577 │ │ for module in self.children(): │
│ ❱ 578 │ │ │ module._apply(fn) │
│ 579 │ │ │
│ 580 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 581 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:578 in │
│ _apply │
│ │
│ 575 │ │
│ 576 │ def _apply(self, fn): │
│ 577 │ │ for module in self.children(): │
│ ❱ 578 │ │ │ module._apply(fn) │
│ 579 │ │ │
│ 580 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 581 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:601 in │
│ _apply │
│ │
│ 598 │ │ │ # track autograd history of `param_applied`, so we have t │
│ 599 │ │ │ # `with torch.no_grad():` │
│ 600 │ │ │ with torch.no_grad(): │
│ ❱ 601 │ │ │ │ param_applied = fn(param) │
│ 602 │ │ │ should_use_set_data = compute_should_use_set_data(param, │
│ 603 │ │ │ if should_use_set_data: │
│ 604 │ │ │ │ param.data = param_applied │
│ │
│ /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:905 in │
│ convert │
│ │
│ 902 │ │ │ if convert_to_format is not None and t.dim() in (4, 5): │
│ 903 │ │ │ │ return t.to(device, dtype if t.is_floating_point() or │
│ 904 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_fo │
│ ❱ 905 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.i │
│ 906 │ │ │
│ 907 │ │ return self._apply(convert) │
│ 908 │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA out of memory. Tried to allocate 162.00 MiB (GPU 2; 79.35 GiB
total capacity; 77.00 GiB already allocated; 146.19 MiB free; 77.01 GiB reserved
in total by PyTorch) If reserved memory is >> allocated memory try setting
max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2413 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2415 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2412) of binary: /usr/bin/python3
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'n176-051-158.byted.org_2189_0' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousTimeoutError.
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
@MikeChenfu, can you please clarify if you are loading the checkpoint for inference, finetuning, or continued training?
Thanks @tjruwase for reply. Actually I enable the checkpoint for training. But the OOM problem occurs at the preparation stage.
run_clm_no_trainer.py:664 in │
│ <module> │
│ │
│ 661 │
│ 662 │
│ 663 if __name__ == "__main__": │
│ ❱ 664 │ main() │
│ 665 │
│ │
│ /opt/tiger/ByteBM-Training/OPT-Benchmark/run_clm_no_trainer.py:505 in main │
│ │
│ 502 │ │
│ 503 │ # Prepare everything with our `accelerator`. │
│ 504 │ model, optimizer, train_dataloader, eval_dataloader, lr_scheduler │
│ ❱ 505 │ │ model, optimizer, train_dataloader, eval_dataloader, lr_schedu │
│ 506 │ ) │
│ 507 │ │
│ 508 │ # We need to recalculate our total training steps as the size of t │
│ │
│ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:872 in │
│ prepare
@MikeChenfu It may be that the ZeRO 3 context manager is not being used correctly. Can you try initializing the deepspeed config through HuggingFace as below specifically line 6:
from transformers.deepspeed import HfDeepSpeedConfig
model_name = "facebook/opt-66b"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
config = AutoConfig.from_pretrained(model_name)
dschf = HfDeepSpeedConfig(ds_config)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
model, optimizer, _, _ = deepspeed.initialize(args=None, model=model,
model_parameters=model.parameters(), config=ds_config)
Thanks @jomayeri. I will try it.:)
@MikeChenfu I am going to close this issue for now. Feel free to reopen if there are further developments.