ChatGLM2-6B
ChatGLM2-6B copied to clipboard
[BUG/Help] <title>ptuning 微调错误
trafficstars
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
ValueError: None is not in list
[2023-07-06 06:21:42,568] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 27275) of binary: /data/miniconda3/envs/nlp_tf2x/bin/python
Traceback (most recent call last):
File "/data/miniconda3/envs/nlp_tf2x/bin/torchrun", line 8, in
sys.exit(main())
File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in main
run(args)
File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/run.py", line 788, in run
elastic_launch(
File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
main.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-07-06_06:21:42 host : ubuntu rank : 0 (local_rank: 0) exitcode : 1 (pid: 27275) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Expected Behavior
No response
Steps To Reproduce
1.ubuntu
Environment
- OS:ubuntu
- Python:3.10
- Transformers:4.27
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :
Anything else?
No response
请提供完整的报错信息,并确保格式正常
在trainer_seq2seq.py中使用transformers的训练器试试, from transformers.trainer import Trainer
在trainer_seq2seq.py中使用transformers的训练器试试, from transformers.trainer import Trainer
感谢老师,改完还需要把Seq2SeqTrainer中的继承关系改为Trainer,并且save_changed参数需要注释掉,不知道save_changed这个参数是否对模型的保存产生影响