Qwen
Qwen copied to clipboard
RuntimeError: value cannot be converted to type at::Half without overflow
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- [X] 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
No response
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
备注 | Anything else?
/data/programs/python310/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( Exception in thread Thread-9 (run_exp): Traceback (most recent call last): File "/data/programs/python310/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/data/programs/python310/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/data/aigc/LLaMA/LLaMA-Factory-BAK/src/llmtuner/tuner/tune.py", line 26, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/data/aigc/LLaMA/LLaMA-Factory-BAK/src/llmtuner/tuner/sft/workflow.py", line 67, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/data/programs/python310/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/data/programs/python310/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/data/programs/python310/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in training_step loss = self.compute_loss(model, inputs) File "/data/programs/python310/lib/python3.10/site-packages/transformers/trainer.py", line 2704, in compute_loss outputs = model(**inputs) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 185, in forward outputs = self.parallel_apply(replicas, inputs, module_kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 110, in parallel_apply output.reraise() File "/data/programs/python310/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in _worker output = module(*input, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/peft/peft_model.py", line 918, in forward return self.base_model( File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward return self.model.forward(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Qwen-1_8B-Chat/modeling_qwen.py", line 1043, in forward transformer_outputs = self.transformer( File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Qwen-1_8B-Chat/modeling_qwen.py", line 880, in forward outputs = torch.utils.checkpoint.checkpoint( File "/data/programs/python310/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner return torch._dynamo.disable(fn, recursive)(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner return fn(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 451, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/data/programs/python310/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/data/programs/python310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 230, in forward outputs = run_function(*args) File "/root/.cache/huggingface/modules/transformers_modules/Qwen-1_8B-Chat/modeling_qwen.py", line 876, in custom_forward return module(*inputs, use_cache, output_attentions) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Qwen-1_8B-Chat/modeling_qwen.py", line 610, in forward attn_outputs = self.attn( File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Qwen-1_8B-Chat/modeling_qwen.py", line 525, in forward attention_mask = attention_mask.masked_fill(~causal_mask, torch.finfo(query.dtype).min) RuntimeError: value cannot be converted to type at::Half without overflow
我修改了源码modeling_qwen.py中的525行attention_mask.masked_fill(~causal_mask, torch.finfo(query.dtype).min) 改为attention_mask.masked_fill(~causal_mask, -1e4)就可以训练了。 torch.finfo(query.dtype).min 表示选择的数据类型的最小可能值。这个值可能太小,导致溢出错误。例如使用 -1e4 这样的较大值可以避免这种溢出。如果使用-1e5,仍然报相同的错误。
经过测试最小值-65504.0 可以避免这个报错。再小例如-65505.0就会报这个错。-65504.0对应的是半精度浮点数float16的最小值。
显卡问题, 我修改了config.json中的配置,启用了bf16=True 就可以了