[BUG] <title> gradient reduction issues when running training script with latest released model 2-6
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- [X] 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
Excellent work! I am attempting to run a training script using ZeRO Stage 2/3 optimization. However, I have encountered the following issues:
miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 939, in reduce_independent_p_g_buckets_and_remove_grads
assert self.params_already_reduced[param_id] == False,
AssertionError: The parameter 657 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported
Traceback (most recent call last):
I'm using a conda env config provided by the doc. Is there any fix to this issue? Thanks in advance!
期望行为 | Expected Behavior
successfully launch training script with zero 2/3
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
- OS: ubuntu 20
- Python: 3.10
- Transformers: as required by doc
- PyTorch:as required by doc
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1
备注 | Anything else?
No response
I'm having the same issue on Python 3.10, CUDA 12.1 and Torch 2.3.1. If I train without Zero 2/3 the issue goes away but this limits me to training only on A100 GPUs which is not very convenient. Btw, this issue does not occur with the Llama 3 2.5 version even with the latest changes to the repo, so it's likely specific to 2.6.
I have the same question
I have the same question when setting --per_device_train_batch_size greater than 1.
I have the same question when setting
--per_device_train_batch_sizegreater than 1.
Address by setting use_reentrant=False
training_args.gradient_checkpointing_kwargs = {
"use_reentrant": False
}
trainer = CPMTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
**data_module,
)
I have the same question when setting
--per_device_train_batch_sizegreater than 1.Address by setting
use_reentrant=Falsetraining_args.gradient_checkpointing_kwargs = { "use_reentrant": False } trainer = CPMTrainer( model=model, tokenizer=tokenizer, args=training_args, **data_module, )
Thanks it works