ChatGLM2-6B icon indicating copy to clipboard operation
ChatGLM2-6B copied to clipboard

[BUG/Help] <title>ptuning 微调错误

Open DC-Lin opened this issue 2 years ago • 2 comments
trafficstars

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

ValueError: None is not in list [2023-07-06 06:21:42,568] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 27275) of binary: /data/miniconda3/envs/nlp_tf2x/bin/python Traceback (most recent call last): File "/data/miniconda3/envs/nlp_tf2x/bin/torchrun", line 8, in sys.exit(main()) File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in main run(args) File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-07-06_06:21:42 host : ubuntu rank : 0 (local_rank: 0) exitcode : 1 (pid: 27275) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Expected Behavior

No response

Steps To Reproduce

1.ubuntu

Environment

- OS:ubuntu
- Python:3.10
- Transformers:4.27
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

DC-Lin avatar Jul 06 '23 10:07 DC-Lin

请提供完整的报错信息,并确保格式正常

duzx16 avatar Jul 07 '23 04:07 duzx16

在trainer_seq2seq.py中使用transformers的训练器试试, from transformers.trainer import Trainer

lilongxian avatar Jul 07 '23 09:07 lilongxian

在trainer_seq2seq.py中使用transformers的训练器试试, from transformers.trainer import Trainer

感谢老师,改完还需要把Seq2SeqTrainer中的继承关系改为Trainer,并且save_changed参数需要注释掉,不知道save_changed这个参数是否对模型的保存产生影响

DC-Lin avatar Jul 10 '23 03:07 DC-Lin