ColossalAI
ColossalAI copied to clipboard
[BUG]: ZERO DDP error: the synchronization of gradients doesn't exit properly when train Mixtral-8x7B-v0.1 with GeminiPlugin or HybridParallelPlugin
🐛 Describe the bug
When continue pretraining Mixtral-8x7B-v0.1 with GeminiPlugin or HybridParallelPlugin directly, I got the RuntimeError:
ZERO DDP error: the synchronization of gradients doesn't exit properly.", 'The most possible reason is that the model is not compatible with GeminiDDP.\n', 'Reduction failed at followed parameters:\n\tmodel.layers.22.block_sparse_moe.experts.2.w1.weight\
The full stacktrace:
File "pretrain.py", line 223, in main
booster.backward(loss, optimizer)
File ".local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 167, in backward
optimizer.backward(loss)
File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_optimizer.py", line 291, in backward
self.module.backward(loss)
File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 331, in backward
self._post_backward()
File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 314, in _post_backward
raise RuntimeError(
RuntimeError: (
"ZERO DDP error: the synchronization of gradients doesn't exit properly.",
'The most possible reason is that the model is not compatible with GeminiDDP.
', 'Reduction failed at followed parameters:
model.layers.22.block_sparse_moe.experts.2.w1.weight
model.layers.22.block_sparse_moe.experts.2.w2.weight
model.layers.22.block_sparse_moe.experts.2.w3.weight
model.layers.22.block_sparse_moe.experts.3.w1.weight
model.layers.22.block_sparse_moe.experts.3.w2.weight
model.layers.22.block_sparse_moe.experts.3.w3.weight
model.layers.22.block_sparse_moe.experts.4.w1.weight
model.layers.22.block_sparse_moe.experts.4.w2.weight
model.layers.22.block_sparse_moe.experts.4.w3.weight
model.layers.22.block_sparse_moe.experts.5.w1.weight
model.layers.22.block_sparse_moe.experts.5.w2.weight
model.layers.22.block_sparse_moe.experts.5.w3.weight
model.layers.22.block_sparse_moe.experts.6.w1.weight
model.layers.22.block_sparse_moe.experts.6.w2.weight
model.layers.22.block_sparse_moe.experts.6.w3.weight
model.layers.22.block_sparse_moe.experts.7.w1.weight
model.layers.22.block_sparse_moe.experts.7.w2.weight
model.layers.22.block_sparse_moe.experts.7.w3.weight
model.layers.23.block_sparse_moe.experts.2.w1.weight
model.layers.23.block_sparse_moe.experts.2.w2.weight
model.layers.23.block_sparse_moe.experts.2.w3.weight
model.layers.23.block_sparse_moe.experts.3.w1.weight
model.layers.23.block_sparse_moe.experts.3.w2.weight
model.layers.23.block_sparse_moe.experts.3.w3.weight
model.layers.23.block_sparse_moe.experts.4.w1.weight
model.layers.23.block_sparse_moe.experts.4.w2.weight
model.layers.23.block_sparse_moe.experts.4.w3.weight
model.layers.23.block_sparse_moe.experts.5.w1.weight
model.layers.23.block_sparse_moe.experts.5.w2.weight
model.layers.23.block_sparse_moe.experts.5.w3.weight
model.layers.23.block_sparse_moe.experts.6.w1.weight
model.layers.23.block_sparse_moe.experts.6.w2.weight
model.layers.23.block_sparse_moe.experts.6.w3.weight
model.layers.23.block_sparse_moe.experts.7.w1.weight
model.layers.23.block_sparse_moe.experts.7.w2.weight
model.layers.23.block_sparse_moe.experts.7.w3.weight
model.layers.24.block_sparse_moe.experts.6.w1.weight
model.layers.24.block_sparse_moe.experts.6.w2.weight
model.layers.24.block_sparse_moe.experts.6.w3.weight
model.layers.29.block_sparse_moe.experts.2.w1.weight
model.layers.29.block_sparse_moe.experts.2.w2.weight
model.layers.29.block_sparse_moe.experts.2.w3.weight
model.layers.29.block_sparse_moe.experts.3.w1.weight
model.layers.29.block_sparse_moe.experts.3.w2.weight
model.layers.29.block_sparse_moe.experts.3.w3.weight
model.layers.29.block_sparse_moe.experts.4.w1.weight
model.layers.29.block_sparse_moe.experts.4.w2.weight
model.layers.29.block_sparse_moe.experts.4.w3.weight
model.layers.29.block_sparse_moe.experts.5.w1.weight
model.layers.29.block_sparse_moe.experts.5.w2.weight
model.layers.29.block_sparse_moe.experts.5.w3.weight
model.layers.29.block_sparse_moe.experts.6.w1.weight
model.layers.29.block_sparse_moe.experts.6.w2.weight
model.layers.29.block_sparse_moe.experts.6.w3.weight
model.layers.29.block_sparse_moe.experts.7.w1.weight
model.layers.29.block_sparse_moe.experts.7.w2.weight
model.layers.29.block_sparse_moe.experts.7.w3.weight
model.layers.30.block_sparse_moe.experts.2.w1.weight
model.layers.30.block_sparse_moe.experts.2.w2.weight
model.layers.30.block_sparse_moe.experts.2.w3.weight
model.layers.30.block_sparse_moe.experts.3.w1.weight
model.layers.30.block_sparse_moe.experts.3.w2.weight
model.layers.30.block_sparse_moe.experts.3.w3.weight
model.layers.30.block_sparse_moe.experts.4.w1.weight
model.layers.30.block_sparse_moe.experts.4.w2.weight
model.layers.30.block_sparse_moe.experts.4.w3.weight
model.layers.30.block_sparse_moe.experts.5.w1.weight
model.layers.30.block_sparse_moe.experts.5.w2.weight
model.layers.30.block_sparse_moe.experts.5.w3.weight
model.layers.30.block_sparse_moe.experts.6.w1.weight
model.layers.30.block_sparse_moe.experts.6.w2.weight
model.layers.30.block_sparse_moe.experts.6.w3.weight
model.layers.30.block_sparse_moe.experts.7.w1.weight
model.layers.30.block_sparse_moe.experts.7.w2.weight
model.layers.30.block_sparse_moe.experts.7.w3.weight
model.layers.31.block_sparse_moe.experts.2.w1.weight
model.layers.31.block_sparse_moe.experts.2.w2.weight
model.layers.31.block_sparse_moe.experts.2.w3.weight
model.layers.31.block_sparse_moe.experts.3.w1.weight
model.layers.31.block_sparse_moe.experts.3.w2.weight
model.layers.31.block_sparse_moe.experts.3.w3.weight
model.layers.31.block_sparse_moe.experts.4.w1.weight
model.layers.31.block_sparse_moe.experts.4.w2.weight
model.layers.31.block_sparse_moe.experts.4.w3.weight
model.layers.31.block_sparse_moe.experts.5.w1.weight
model.layers.31.block_sparse_moe.experts.5.w2.weight
model.layers.31.block_sparse_moe.experts.5.w3.weight
model.layers.31.block_sparse_moe.experts.6.w1.weight
model.layers.31.block_sparse_moe.experts.6.w2.weight
model.layers.31.block_sparse_moe.experts.6.w3.weight
model.layers.31.block_sparse_moe.experts.7.w1.weight
model.layers.31.block_sparse_moe.experts.7.w2.weight
model.layers.31.block_sparse_moe.experts.7.w3.weight
')
Environment
- Mixtral: https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
- Pytorch: 2.0.1
- ColossalAI: main branch
- Huggingface Transformers: 4.36.0
any update for this?
the same problem
Hi, I want to know how to install this MoE-structure to pretrain my model. I have done nothing about installation right now. Do I follow the steps in "https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/openmoe", and is it enough ?
Hi, I want to know how to install this MoE-structure to pretrain my model. I have done nothing about installation right now. Do I follow the steps in "https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/openmoe", and is it enough ?
Yes, the code of open-moe has been updated, you can follow the main branch. https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalMoE/train.py
any update for this?
Hi, you can follow the latest code in the main branch.
Hi, I want to know how to install this MoE-structure to pretrain my model. I have done nothing about installation right now. Do I follow the steps in "https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/openmoe", and is it enough ?
Yes.
🐛 Describe the bug
When continue pretraining Mixtral-8x7B-v0.1 with GeminiPlugin or HybridParallelPlugin directly, I got the RuntimeError:
ZERO DDP error: the synchronization of gradients doesn't exit properly.", 'The most possible reason is that the model is not compatible with GeminiDDP.\n', 'Reduction failed at followed parameters:\n\tmodel.layers.22.block_sparse_moe.experts.2.w1.weight\
The full stacktrace:
File "pretrain.py", line 223, in main booster.backward(loss, optimizer) File ".local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 167, in backward optimizer.backward(loss) File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_optimizer.py", line 291, in backward self.module.backward(loss) File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 331, in backward self._post_backward() File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 314, in _post_backward raise RuntimeError( RuntimeError: ( "ZERO DDP error: the synchronization of gradients doesn't exit properly.", 'The most possible reason is that the model is not compatible with GeminiDDP. ', 'Reduction failed at followed parameters: model.layers.22.block_sparse_moe.experts.2.w1.weight model.layers.22.block_sparse_moe.experts.2.w2.weight model.layers.22.block_sparse_moe.experts.2.w3.weight model.layers.22.block_sparse_moe.experts.3.w1.weight model.layers.22.block_sparse_moe.experts.3.w2.weight model.layers.22.block_sparse_moe.experts.3.w3.weight model.layers.22.block_sparse_moe.experts.4.w1.weight model.layers.22.block_sparse_moe.experts.4.w2.weight model.layers.22.block_sparse_moe.experts.4.w3.weight model.layers.22.block_sparse_moe.experts.5.w1.weight model.layers.22.block_sparse_moe.experts.5.w2.weight model.layers.22.block_sparse_moe.experts.5.w3.weight model.layers.22.block_sparse_moe.experts.6.w1.weight model.layers.22.block_sparse_moe.experts.6.w2.weight model.layers.22.block_sparse_moe.experts.6.w3.weight model.layers.22.block_sparse_moe.experts.7.w1.weight model.layers.22.block_sparse_moe.experts.7.w2.weight model.layers.22.block_sparse_moe.experts.7.w3.weight model.layers.23.block_sparse_moe.experts.2.w1.weight model.layers.23.block_sparse_moe.experts.2.w2.weight model.layers.23.block_sparse_moe.experts.2.w3.weight model.layers.23.block_sparse_moe.experts.3.w1.weight model.layers.23.block_sparse_moe.experts.3.w2.weight model.layers.23.block_sparse_moe.experts.3.w3.weight model.layers.23.block_sparse_moe.experts.4.w1.weight model.layers.23.block_sparse_moe.experts.4.w2.weight model.layers.23.block_sparse_moe.experts.4.w3.weight model.layers.23.block_sparse_moe.experts.5.w1.weight model.layers.23.block_sparse_moe.experts.5.w2.weight model.layers.23.block_sparse_moe.experts.5.w3.weight model.layers.23.block_sparse_moe.experts.6.w1.weight model.layers.23.block_sparse_moe.experts.6.w2.weight model.layers.23.block_sparse_moe.experts.6.w3.weight model.layers.23.block_sparse_moe.experts.7.w1.weight model.layers.23.block_sparse_moe.experts.7.w2.weight model.layers.23.block_sparse_moe.experts.7.w3.weight model.layers.24.block_sparse_moe.experts.6.w1.weight model.layers.24.block_sparse_moe.experts.6.w2.weight model.layers.24.block_sparse_moe.experts.6.w3.weight model.layers.29.block_sparse_moe.experts.2.w1.weight model.layers.29.block_sparse_moe.experts.2.w2.weight model.layers.29.block_sparse_moe.experts.2.w3.weight model.layers.29.block_sparse_moe.experts.3.w1.weight model.layers.29.block_sparse_moe.experts.3.w2.weight model.layers.29.block_sparse_moe.experts.3.w3.weight model.layers.29.block_sparse_moe.experts.4.w1.weight model.layers.29.block_sparse_moe.experts.4.w2.weight model.layers.29.block_sparse_moe.experts.4.w3.weight model.layers.29.block_sparse_moe.experts.5.w1.weight model.layers.29.block_sparse_moe.experts.5.w2.weight model.layers.29.block_sparse_moe.experts.5.w3.weight model.layers.29.block_sparse_moe.experts.6.w1.weight model.layers.29.block_sparse_moe.experts.6.w2.weight model.layers.29.block_sparse_moe.experts.6.w3.weight model.layers.29.block_sparse_moe.experts.7.w1.weight model.layers.29.block_sparse_moe.experts.7.w2.weight model.layers.29.block_sparse_moe.experts.7.w3.weight model.layers.30.block_sparse_moe.experts.2.w1.weight model.layers.30.block_sparse_moe.experts.2.w2.weight model.layers.30.block_sparse_moe.experts.2.w3.weight model.layers.30.block_sparse_moe.experts.3.w1.weight model.layers.30.block_sparse_moe.experts.3.w2.weight model.layers.30.block_sparse_moe.experts.3.w3.weight model.layers.30.block_sparse_moe.experts.4.w1.weight model.layers.30.block_sparse_moe.experts.4.w2.weight model.layers.30.block_sparse_moe.experts.4.w3.weight model.layers.30.block_sparse_moe.experts.5.w1.weight model.layers.30.block_sparse_moe.experts.5.w2.weight model.layers.30.block_sparse_moe.experts.5.w3.weight model.layers.30.block_sparse_moe.experts.6.w1.weight model.layers.30.block_sparse_moe.experts.6.w2.weight model.layers.30.block_sparse_moe.experts.6.w3.weight model.layers.30.block_sparse_moe.experts.7.w1.weight model.layers.30.block_sparse_moe.experts.7.w2.weight model.layers.30.block_sparse_moe.experts.7.w3.weight model.layers.31.block_sparse_moe.experts.2.w1.weight model.layers.31.block_sparse_moe.experts.2.w2.weight model.layers.31.block_sparse_moe.experts.2.w3.weight model.layers.31.block_sparse_moe.experts.3.w1.weight model.layers.31.block_sparse_moe.experts.3.w2.weight model.layers.31.block_sparse_moe.experts.3.w3.weight model.layers.31.block_sparse_moe.experts.4.w1.weight model.layers.31.block_sparse_moe.experts.4.w2.weight model.layers.31.block_sparse_moe.experts.4.w3.weight model.layers.31.block_sparse_moe.experts.5.w1.weight model.layers.31.block_sparse_moe.experts.5.w2.weight model.layers.31.block_sparse_moe.experts.5.w3.weight model.layers.31.block_sparse_moe.experts.6.w1.weight model.layers.31.block_sparse_moe.experts.6.w2.weight model.layers.31.block_sparse_moe.experts.6.w3.weight model.layers.31.block_sparse_moe.experts.7.w1.weight model.layers.31.block_sparse_moe.experts.7.w2.weight model.layers.31.block_sparse_moe.experts.7.w3.weight ')
Environment
- Mixtral: https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
- Pytorch: 2.0.1
- ColossalAI: main branch
- Huggingface Transformers: 4.36.0
In our environment, the geminiplugin can run the training for MixTrial. Could you provide more detailed script information?
Using the main branch applications/ColossalMoE
, I include:
elif args.plugin == "gemini":
plugin = GeminiPlugin(
precision=args.precision,
initial_scale=2 ** 16,
max_norm=1.0,
tp_size=1,
extra_dp_size=1
)
I still encountered the error:
Finish init booster
Start finetuning
Epoch [1/1]: 0%| | 0/25 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
Epoch [1/1]: 84%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 21/25 [00:13<00:02, 1.67it/s, loss=9.32]Traceback (most recent call last):
File "ColossalAI/applications/ColossalMoE/train.py", line 331, in <module>
main()
File "ColossalAI/applications/ColossalMoE/train.py", line 292, in main
booster.backward(loss, optimizer)
File ".local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 167, in backward
optimizer.backward(loss)
File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_optimizer.py", line 292, in backward
self.module.backward(loss)
File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 332, in backward
self._post_backward()
File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 315, in _post_backward
raise RuntimeError(
RuntimeError: ("ZERO DDP error: the synchronization of gradients doesn't exit properly.", 'The most possible reason is that the model is not compatible with GeminiDDP.\n', 'Reduction failed at followed parameters:\n\tmodel.layers.8.block_sparse_moe.experts.0.w1.weight\n\tmodel.layers.8.block_sparse_moe.experts.0.w2.weight\n\tmodel.layers.8.block_sparse_moe.experts.0.w3.weight')
[2024-03-03 16:06:40,895] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 6872 closing signal SIGTERM
[2024-03-03 16:06:41,209] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 6873) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File ".local/bin/torchrun", line 8, in <module>
sys.exit(main())
File ".local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File ".local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File ".local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File ".local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File ".local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
@flybird11111 could you take a look? Thanks a lot.
with MoeHybridParallelPlugin
I met another problem: https://github.com/hpcaitech/ColossalAI/issues/5426
grad = grad.to(master_moe_param.dtype).to(master_moe_param.device)
AttributeError: 'NoneType' object has no attribute 'to'
with
MoeHybridParallelPlugin
I met another problem: #5426grad = grad.to(master_moe_param.dtype).to(master_moe_param.device) AttributeError: 'NoneType' object has no attribute 'to'
ok, I'll look into it.
without EPMixtralSparseMoeBlock
in mixtral_policy, I met another problem:
hangs at the following point:
with
MoeHybridParallelPlugin
I met another problem: #5426grad = grad.to(master_moe_param.dtype).to(master_moe_param.device) AttributeError: 'NoneType' object has no attribute 'to'
ok, I'll look into it.
Thanks a lot. Did you find some clue, I can do some fix/test. @flybird11111
GeminiPlugin can run correctly now.