ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: RuntimeError: Failed to replace block_sparse_moe of type MixtralSparseMoeBlock with EPMixtralSparseMoeBlock with the exception: CUDA out of memory

Open ericxsun opened this issue 4 months ago • 0 comments

🐛 Describe the bug

When training the Mixture of Experts (MoE) model with code snippets in the application/ColossalMoE, I encountered Out of Memory (OOM) issues at the beginning.

RuntimeError: Failed to replace block_sparse_moe of type MixtralSparseMoeBlock with EPMixtralSparseMoeBlock with the exception: CUDA out of memory. Tried to allocate 112.00 MiB. 

full trace:


Traceback (most recent call last):

File ".local/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 197, in _replace_sub_module

replace_layer = target_module.from_native_module(

File "mixtral/mixtral_layer.py", line 40, in from_native_module

LazyInitContext.materialize(module)

File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 600, in materialize

return _apply_to_lazy_module(module, apply_fn, verbose)

File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 625, in _apply_to_lazy_module

apply_fn(name, p)

File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 598, in apply_fn

p.materialize()

File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 215, in materialize

target = self._materialize_data()

File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 240, in _materialize_data

init_val = func(

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 1 has a total capacty of 79.32 GiB of which 43.56 MiB is free. Process 3466292 has 79.28 GiB memory in use. Of the allocated memory 78.10 GiB is allocated by PyTorch, and 127.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "debug.py", line 553, in <module>

main(args, cfg)

File "debug.py", line 174, in main

model, optimizer, _, _, lr_scheduler = booster.boost(model, optimizer, lr_scheduler=lr_scheduler)

File ".local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 138, in boost

model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure(

File ".local/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 355, in configure

model = HybridParallelModule(

File ".local/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 70, in __init__

module, self.shared_params = shardformer.optimize(module, policy=custom_policy)

File ".local/lib/python3.10/site-packages/colossalai/shardformer/shard/shardformer.py", line 54, in optimize

shared_params = sharder.shard()

File ".local/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 43, in shard

self._replace_module(include=held_layers)

File ".local/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 67, in _replace_module

self._recursive_replace_layer(

File ".local/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer

self._recursive_replace_layer(

File ".local/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer

self._recursive_replace_layer(

File ".local/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer

self._recursive_replace_layer(

File ".local/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 112, in _recursive_replace_layer

self._replace_sub_module(module, sub_module_replacement, include)

File ".local/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 201, in _replace_sub_module

raise RuntimeError(

RuntimeError: Failed to replace block_sparse_moe of type MixtralSparseMoeBlock with EPMixtralSparseMoeBlock with the exception: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 1 has a total capacty of 79.32 GiB of which 43.56 MiB is free. Process 3466292 has 79.28 GiB memory in use. Of the allocated memory 78.10 GiB is allocated by PyTorch, and 127.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.

Could some help ? Thanks a lot. cc @flybird11111 @ver217

Environment

  • main branch (0.3.6)
  • Pytorch 2.1.2
  • Cuda 11.8
  • gpus: 8x8 A800 80G
  • Model: mistralai/Mixtral-8x7B-v0.1
  • tp_size = 1
  • ep_size = 8
  • pp_size = 1

ericxsun avatar Mar 14 '24 14:03 ericxsun