while trying to fine tune model on kaggle this error appear :ValueError: Type fp16 is not supported.

Open mohamed-em2m opened this issue 1 year ago • 0 comments

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

max_steps is given, it will override any value given in num_train_epochs Traceback (most recent call last): File "/kaggle/working/MiniCPM-V/finetune/finetune.py", line 328, in train() File "/kaggle/working/MiniCPM-V/finetune/finetune.py", line 318, in train trainer.train() File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2045, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1291, in prepare result = self._prepare_deepspeed(*args) File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1758, in _prepare_deepspeed engine, optimizer, _, lr_scheduler = deepspeed.initialize(kwargs) File "/opt/conda/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize engine = DeepSpeedEngine(args=args, File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 240, in init self._do_sanity_check() File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1032, in _do_sanity_check raise ValueError("Type fp16 is not supported.") ValueError: Type fp16 is not supported. [2024-06-19 00:17:44,932] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 244) of binary: /opt/conda/bin/python3.10 Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init**.py", line 346, in wrapper return f(*args, kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-06-19_00:17:44 host : d90c1cf96f39 rank : 0 (local_rank: 0) exitcode : 1 (pid: 244) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

Jun 19 '24 00:06 mohamed-em2m