MiniCPM-V [BUG] <title> gradient reduction issues when running training script with latest released model 2-6

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

Excellent work! I am attempting to run a training script using ZeRO Stage 2/3 optimization. However, I have encountered the following issues: miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 939, in reduce_independent_p_g_buckets_and_remove_grads assert self.params_already_reduced[param_id] == False,
AssertionError: The parameter 657 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported Traceback (most recent call last):

I'm using a conda env config provided by the doc. Is there any fix to this issue? Thanks in advance!

期望行为 | Expected Behavior

successfully launch training script with zero 2/3

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS: ubuntu 20
- Python: 3.10
- Transformers: as required by doc
- PyTorch:as required by doc
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1

备注 | Anything else?

No response

Aug 08 '24 06:08 laoShuaiGe

I'm having the same issue on Python 3.10, CUDA 12.1 and Torch 2.3.1. If I train without Zero 2/3 the issue goes away but this limits me to training only on A100 GPUs which is not very convenient. Btw, this issue does not occur with the Llama 3 2.5 version even with the latest changes to the repo, so it's likely specific to 2.6.

Aug 08 '24 18:08 alceballosa

I have the same question

Aug 09 '24 03:08 zlsjsj

I have the same question when setting --per_device_train_batch_size greater than 1.

Aug 09 '24 09:08 bofei5675

I have the same question when setting --per_device_train_batch_size greater than 1.

Address by setting use_reentrant=False


training_args.gradient_checkpointing_kwargs = {
        "use_reentrant": False
}
trainer = CPMTrainer(
      model=model,
      tokenizer=tokenizer,
      args=training_args,
      **data_module,
 )

Aug 09 '24 11:08 bofei5675

I have the same question when setting --per_device_train_batch_size greater than 1.

Address by setting use_reentrant=False
training_args.gradient_checkpointing_kwargs = {
        "use_reentrant": False
}
trainer = CPMTrainer(
      model=model,
      tokenizer=tokenizer,
      args=training_args,
      **data_module,
 )

Thanks it works

Aug 11 '24 06:08 huynhbaobk