ColossalAI [BUG]: ZERO DDP error: the synchronization of gradients doesn't exit properly when train Mixtral-8x7B-v0.1 with GeminiPlugin or HybridParallelPlugin

🐛 Describe the bug

When continue pretraining Mixtral-8x7B-v0.1 with GeminiPlugin or HybridParallelPlugin directly, I got the RuntimeError:

ZERO DDP error: the synchronization of gradients doesn't exit properly.", 'The most possible reason is that the model is not compatible with GeminiDDP.\n', 'Reduction failed at followed parameters:\n\tmodel.layers.22.block_sparse_moe.experts.2.w1.weight\

The full stacktrace:

File "pretrain.py", line 223, in main

booster.backward(loss, optimizer)

File ".local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 167, in backward

optimizer.backward(loss)

File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_optimizer.py", line 291, in backward

self.module.backward(loss)

File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 331, in backward

self._post_backward()

File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 314, in _post_backward

raise RuntimeError(

RuntimeError: (
"ZERO DDP error: the synchronization of gradients doesn't exit properly.",
'The most possible reason is that the model is not compatible with GeminiDDP.
	', 'Reduction failed at followed parameters:
	model.layers.22.block_sparse_moe.experts.2.w1.weight
	model.layers.22.block_sparse_moe.experts.2.w2.weight
	model.layers.22.block_sparse_moe.experts.2.w3.weight
	model.layers.22.block_sparse_moe.experts.3.w1.weight
	model.layers.22.block_sparse_moe.experts.3.w2.weight
	model.layers.22.block_sparse_moe.experts.3.w3.weight
	model.layers.22.block_sparse_moe.experts.4.w1.weight
	model.layers.22.block_sparse_moe.experts.4.w2.weight
	model.layers.22.block_sparse_moe.experts.4.w3.weight
	model.layers.22.block_sparse_moe.experts.5.w1.weight
	model.layers.22.block_sparse_moe.experts.5.w2.weight
	model.layers.22.block_sparse_moe.experts.5.w3.weight
	model.layers.22.block_sparse_moe.experts.6.w1.weight
	model.layers.22.block_sparse_moe.experts.6.w2.weight
	model.layers.22.block_sparse_moe.experts.6.w3.weight
	model.layers.22.block_sparse_moe.experts.7.w1.weight
	model.layers.22.block_sparse_moe.experts.7.w2.weight
	model.layers.22.block_sparse_moe.experts.7.w3.weight
	model.layers.23.block_sparse_moe.experts.2.w1.weight
	model.layers.23.block_sparse_moe.experts.2.w2.weight
	model.layers.23.block_sparse_moe.experts.2.w3.weight
	model.layers.23.block_sparse_moe.experts.3.w1.weight
	model.layers.23.block_sparse_moe.experts.3.w2.weight
	model.layers.23.block_sparse_moe.experts.3.w3.weight
	model.layers.23.block_sparse_moe.experts.4.w1.weight
	model.layers.23.block_sparse_moe.experts.4.w2.weight
	model.layers.23.block_sparse_moe.experts.4.w3.weight
	model.layers.23.block_sparse_moe.experts.5.w1.weight
	model.layers.23.block_sparse_moe.experts.5.w2.weight
	model.layers.23.block_sparse_moe.experts.5.w3.weight
	model.layers.23.block_sparse_moe.experts.6.w1.weight
	model.layers.23.block_sparse_moe.experts.6.w2.weight
	model.layers.23.block_sparse_moe.experts.6.w3.weight
	model.layers.23.block_sparse_moe.experts.7.w1.weight
	model.layers.23.block_sparse_moe.experts.7.w2.weight
	model.layers.23.block_sparse_moe.experts.7.w3.weight
	model.layers.24.block_sparse_moe.experts.6.w1.weight
	model.layers.24.block_sparse_moe.experts.6.w2.weight
	model.layers.24.block_sparse_moe.experts.6.w3.weight
	model.layers.29.block_sparse_moe.experts.2.w1.weight
	model.layers.29.block_sparse_moe.experts.2.w2.weight
	model.layers.29.block_sparse_moe.experts.2.w3.weight
	model.layers.29.block_sparse_moe.experts.3.w1.weight
	model.layers.29.block_sparse_moe.experts.3.w2.weight
	model.layers.29.block_sparse_moe.experts.3.w3.weight
	model.layers.29.block_sparse_moe.experts.4.w1.weight
	model.layers.29.block_sparse_moe.experts.4.w2.weight
	model.layers.29.block_sparse_moe.experts.4.w3.weight
	model.layers.29.block_sparse_moe.experts.5.w1.weight
	model.layers.29.block_sparse_moe.experts.5.w2.weight
	model.layers.29.block_sparse_moe.experts.5.w3.weight
	model.layers.29.block_sparse_moe.experts.6.w1.weight
	model.layers.29.block_sparse_moe.experts.6.w2.weight
	model.layers.29.block_sparse_moe.experts.6.w3.weight
	model.layers.29.block_sparse_moe.experts.7.w1.weight
	model.layers.29.block_sparse_moe.experts.7.w2.weight
	model.layers.29.block_sparse_moe.experts.7.w3.weight
	model.layers.30.block_sparse_moe.experts.2.w1.weight
	model.layers.30.block_sparse_moe.experts.2.w2.weight
	model.layers.30.block_sparse_moe.experts.2.w3.weight
	model.layers.30.block_sparse_moe.experts.3.w1.weight
	model.layers.30.block_sparse_moe.experts.3.w2.weight
	model.layers.30.block_sparse_moe.experts.3.w3.weight
	model.layers.30.block_sparse_moe.experts.4.w1.weight
	model.layers.30.block_sparse_moe.experts.4.w2.weight
	model.layers.30.block_sparse_moe.experts.4.w3.weight
	model.layers.30.block_sparse_moe.experts.5.w1.weight
	model.layers.30.block_sparse_moe.experts.5.w2.weight
	model.layers.30.block_sparse_moe.experts.5.w3.weight
	model.layers.30.block_sparse_moe.experts.6.w1.weight
	model.layers.30.block_sparse_moe.experts.6.w2.weight
	model.layers.30.block_sparse_moe.experts.6.w3.weight
	model.layers.30.block_sparse_moe.experts.7.w1.weight
	model.layers.30.block_sparse_moe.experts.7.w2.weight
	model.layers.30.block_sparse_moe.experts.7.w3.weight
	model.layers.31.block_sparse_moe.experts.2.w1.weight
	model.layers.31.block_sparse_moe.experts.2.w2.weight
	model.layers.31.block_sparse_moe.experts.2.w3.weight
	model.layers.31.block_sparse_moe.experts.3.w1.weight
	model.layers.31.block_sparse_moe.experts.3.w2.weight
	model.layers.31.block_sparse_moe.experts.3.w3.weight
	model.layers.31.block_sparse_moe.experts.4.w1.weight
	model.layers.31.block_sparse_moe.experts.4.w2.weight
	model.layers.31.block_sparse_moe.experts.4.w3.weight
	model.layers.31.block_sparse_moe.experts.5.w1.weight
	model.layers.31.block_sparse_moe.experts.5.w2.weight
	model.layers.31.block_sparse_moe.experts.5.w3.weight
	model.layers.31.block_sparse_moe.experts.6.w1.weight
	model.layers.31.block_sparse_moe.experts.6.w2.weight
	model.layers.31.block_sparse_moe.experts.6.w3.weight
	model.layers.31.block_sparse_moe.experts.7.w1.weight
	model.layers.31.block_sparse_moe.experts.7.w2.weight
	model.layers.31.block_sparse_moe.experts.7.w3.weight
')

Environment

Mixtral: https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
Pytorch: 2.0.1
ColossalAI: main branch
Huggingface Transformers: 4.36.0

Dec 14 '23 05:12 ericxsun

any update for this?

Jan 21 '24 15:01 xs1997zju

the same problem

Jan 22 '24 08:01 xs1997zju

Hi， I want to know how to install this MoE-structure to pretrain my model. I have done nothing about installation right now. Do I follow the steps in "https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/openmoe", and is it enough ?

Jan 29 '24 08:01 ZhangEnmao

Hi， I want to know how to install this MoE-structure to pretrain my model. I have done nothing about installation right now. Do I follow the steps in "https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/openmoe", and is it enough ?

Yes, the code of open-moe has been updated, you can follow the main branch. https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalMoE/train.py

Feb 18 '24 11:02 flybird11111

any update for this?

Hi, you can follow the latest code in the main branch.

Hi， I want to know how to install this MoE-structure to pretrain my model. I have done nothing about installation right now. Do I follow the steps in "https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/openmoe", and is it enough ?

Yes.

🐛 Describe the bug

When continue pretraining Mixtral-8x7B-v0.1 with GeminiPlugin or HybridParallelPlugin directly, I got the RuntimeError:

ZERO DDP error: the synchronization of gradients doesn't exit properly.", 'The most possible reason is that the model is not compatible with GeminiDDP.\n', 'Reduction failed at followed parameters:\n\tmodel.layers.22.block_sparse_moe.experts.2.w1.weight\

The full stacktrace:

File "pretrain.py", line 223, in main

booster.backward(loss, optimizer)

File ".local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 167, in backward

optimizer.backward(loss)

File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_optimizer.py", line 291, in backward

self.module.backward(loss)

File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 331, in backward

self._post_backward()

File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 314, in _post_backward

raise RuntimeError(

RuntimeError: (
"ZERO DDP error: the synchronization of gradients doesn't exit properly.",
'The most possible reason is that the model is not compatible with GeminiDDP.
	', 'Reduction failed at followed parameters:
	model.layers.22.block_sparse_moe.experts.2.w1.weight
	model.layers.22.block_sparse_moe.experts.2.w2.weight
	model.layers.22.block_sparse_moe.experts.2.w3.weight
	model.layers.22.block_sparse_moe.experts.3.w1.weight
	model.layers.22.block_sparse_moe.experts.3.w2.weight
	model.layers.22.block_sparse_moe.experts.3.w3.weight
	model.layers.22.block_sparse_moe.experts.4.w1.weight
	model.layers.22.block_sparse_moe.experts.4.w2.weight
	model.layers.22.block_sparse_moe.experts.4.w3.weight
	model.layers.22.block_sparse_moe.experts.5.w1.weight
	model.layers.22.block_sparse_moe.experts.5.w2.weight
	model.layers.22.block_sparse_moe.experts.5.w3.weight
	model.layers.22.block_sparse_moe.experts.6.w1.weight
	model.layers.22.block_sparse_moe.experts.6.w2.weight
	model.layers.22.block_sparse_moe.experts.6.w3.weight
	model.layers.22.block_sparse_moe.experts.7.w1.weight
	model.layers.22.block_sparse_moe.experts.7.w2.weight
	model.layers.22.block_sparse_moe.experts.7.w3.weight
	model.layers.23.block_sparse_moe.experts.2.w1.weight
	model.layers.23.block_sparse_moe.experts.2.w2.weight
	model.layers.23.block_sparse_moe.experts.2.w3.weight
	model.layers.23.block_sparse_moe.experts.3.w1.weight
	model.layers.23.block_sparse_moe.experts.3.w2.weight
	model.layers.23.block_sparse_moe.experts.3.w3.weight
	model.layers.23.block_sparse_moe.experts.4.w1.weight
	model.layers.23.block_sparse_moe.experts.4.w2.weight
	model.layers.23.block_sparse_moe.experts.4.w3.weight
	model.layers.23.block_sparse_moe.experts.5.w1.weight
	model.layers.23.block_sparse_moe.experts.5.w2.weight
	model.layers.23.block_sparse_moe.experts.5.w3.weight
	model.layers.23.block_sparse_moe.experts.6.w1.weight
	model.layers.23.block_sparse_moe.experts.6.w2.weight
	model.layers.23.block_sparse_moe.experts.6.w3.weight
	model.layers.23.block_sparse_moe.experts.7.w1.weight
	model.layers.23.block_sparse_moe.experts.7.w2.weight
	model.layers.23.block_sparse_moe.experts.7.w3.weight
	model.layers.24.block_sparse_moe.experts.6.w1.weight
	model.layers.24.block_sparse_moe.experts.6.w2.weight
	model.layers.24.block_sparse_moe.experts.6.w3.weight
	model.layers.29.block_sparse_moe.experts.2.w1.weight
	model.layers.29.block_sparse_moe.experts.2.w2.weight
	model.layers.29.block_sparse_moe.experts.2.w3.weight
	model.layers.29.block_sparse_moe.experts.3.w1.weight
	model.layers.29.block_sparse_moe.experts.3.w2.weight
	model.layers.29.block_sparse_moe.experts.3.w3.weight
	model.layers.29.block_sparse_moe.experts.4.w1.weight
	model.layers.29.block_sparse_moe.experts.4.w2.weight
	model.layers.29.block_sparse_moe.experts.4.w3.weight
	model.layers.29.block_sparse_moe.experts.5.w1.weight
	model.layers.29.block_sparse_moe.experts.5.w2.weight
	model.layers.29.block_sparse_moe.experts.5.w3.weight
	model.layers.29.block_sparse_moe.experts.6.w1.weight
	model.layers.29.block_sparse_moe.experts.6.w2.weight
	model.layers.29.block_sparse_moe.experts.6.w3.weight
	model.layers.29.block_sparse_moe.experts.7.w1.weight
	model.layers.29.block_sparse_moe.experts.7.w2.weight
	model.layers.29.block_sparse_moe.experts.7.w3.weight
	model.layers.30.block_sparse_moe.experts.2.w1.weight
	model.layers.30.block_sparse_moe.experts.2.w2.weight
	model.layers.30.block_sparse_moe.experts.2.w3.weight
	model.layers.30.block_sparse_moe.experts.3.w1.weight
	model.layers.30.block_sparse_moe.experts.3.w2.weight
	model.layers.30.block_sparse_moe.experts.3.w3.weight
	model.layers.30.block_sparse_moe.experts.4.w1.weight
	model.layers.30.block_sparse_moe.experts.4.w2.weight
	model.layers.30.block_sparse_moe.experts.4.w3.weight
	model.layers.30.block_sparse_moe.experts.5.w1.weight
	model.layers.30.block_sparse_moe.experts.5.w2.weight
	model.layers.30.block_sparse_moe.experts.5.w3.weight
	model.layers.30.block_sparse_moe.experts.6.w1.weight
	model.layers.30.block_sparse_moe.experts.6.w2.weight
	model.layers.30.block_sparse_moe.experts.6.w3.weight
	model.layers.30.block_sparse_moe.experts.7.w1.weight
	model.layers.30.block_sparse_moe.experts.7.w2.weight
	model.layers.30.block_sparse_moe.experts.7.w3.weight
	model.layers.31.block_sparse_moe.experts.2.w1.weight
	model.layers.31.block_sparse_moe.experts.2.w2.weight
	model.layers.31.block_sparse_moe.experts.2.w3.weight
	model.layers.31.block_sparse_moe.experts.3.w1.weight
	model.layers.31.block_sparse_moe.experts.3.w2.weight
	model.layers.31.block_sparse_moe.experts.3.w3.weight
	model.layers.31.block_sparse_moe.experts.4.w1.weight
	model.layers.31.block_sparse_moe.experts.4.w2.weight
	model.layers.31.block_sparse_moe.experts.4.w3.weight
	model.layers.31.block_sparse_moe.experts.5.w1.weight
	model.layers.31.block_sparse_moe.experts.5.w2.weight
	model.layers.31.block_sparse_moe.experts.5.w3.weight
	model.layers.31.block_sparse_moe.experts.6.w1.weight
	model.layers.31.block_sparse_moe.experts.6.w2.weight
	model.layers.31.block_sparse_moe.experts.6.w3.weight
	model.layers.31.block_sparse_moe.experts.7.w1.weight
	model.layers.31.block_sparse_moe.experts.7.w2.weight
	model.layers.31.block_sparse_moe.experts.7.w3.weight
')

Environment

Mixtral: https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
Pytorch: 2.0.1
ColossalAI: main branch
Huggingface Transformers: 4.36.0

In our environment, the geminiplugin can run the training for MixTrial. Could you provide more detailed script information?

Feb 18 '24 11:02 flybird11111

Using the main branch applications/ColossalMoE, I include:

  elif args.plugin == "gemini":
        plugin = GeminiPlugin(
            precision=args.precision,
            initial_scale=2 ** 16,
            max_norm=1.0,
            tp_size=1,
            extra_dp_size=1
        )

I still encountered the error:

Finish init booster
Start finetuning
Epoch [1/1]:   0%|                                                                                                                                                                                           | 0/25 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
Epoch [1/1]:  84%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                          | 21/25 [00:13<00:02,  1.67it/s, loss=9.32]Traceback (most recent call last):
  File "ColossalAI/applications/ColossalMoE/train.py", line 331, in <module>
    main()
  File "ColossalAI/applications/ColossalMoE/train.py", line 292, in main
    booster.backward(loss, optimizer)
  File ".local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 167, in backward
    optimizer.backward(loss)
  File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_optimizer.py", line 292, in backward
    self.module.backward(loss)
  File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 332, in backward
    self._post_backward()
  File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 315, in _post_backward
    raise RuntimeError(
RuntimeError: ("ZERO DDP error: the synchronization of gradients doesn't exit properly.", 'The most possible reason is that the model is not compatible with GeminiDDP.\n', 'Reduction failed at followed parameters:\n\tmodel.layers.8.block_sparse_moe.experts.0.w1.weight\n\tmodel.layers.8.block_sparse_moe.experts.0.w2.weight\n\tmodel.layers.8.block_sparse_moe.experts.0.w3.weight')
[2024-03-03 16:06:40,895] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 6872 closing signal SIGTERM
[2024-03-03 16:06:41,209] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 6873) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File ".local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File ".local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File ".local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File ".local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File ".local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File ".local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------

@flybird11111 could you take a look? Thanks a lot.

Mar 03 '24 08:03 ericxsun

with MoeHybridParallelPlugin I met another problem: https://github.com/hpcaitech/ColossalAI/issues/5426

     grad = grad.to(master_moe_param.dtype).to(master_moe_param.device)
AttributeError: 'NoneType' object has no attribute 'to'

Mar 05 '24 02:03 ericxsun

with MoeHybridParallelPlugin I met another problem: #5426

     grad = grad.to(master_moe_param.dtype).to(master_moe_param.device)
AttributeError: 'NoneType' object has no attribute 'to'

ok, I'll look into it.

Mar 05 '24 02:03 flybird11111

without EPMixtralSparseMoeBlock in mixtral_policy, I met another problem:

hangs at the following point:

Mar 05 '24 05:03 ericxsun

with MoeHybridParallelPlugin I met another problem: #5426
     grad = grad.to(master_moe_param.dtype).to(master_moe_param.device)
AttributeError: 'NoneType' object has no attribute 'to'
ok, I'll look into it.

Thanks a lot. Did you find some clue, I can do some fix/test. @flybird11111

Mar 06 '24 06:03 ericxsun

GeminiPlugin can run correctly now.

Mar 24 '24 02:03 ericxsun

ColossalAI ColossalAI copied to clipboard

[BUG]: ZERO DDP error: the synchronization of gradients doesn't exit properly when train Mixtral-8x7B-v0.1 with GeminiPlugin or HybridParallelPlugin

🐛 Describe the bug

Environment

🐛 Describe the bug

Environment

ColossalAI
ColossalAI copied to clipboard