[BUG] 多卡微调报错
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- [X] 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
多卡微调保存checkpoint时报错,看起来tmp目录已经在过程中被trainer.py os.rename过 观察目录也是生成checkpoint-2-tmp之后,被rename成checkpoint-2,然后报错 相关issue:https://github.com/OpenBMB/MiniCPM-V/issues/85#issuecomment-2121567801
File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2383, in _save_checkpoint self._save_checkpoint(model, trial, metrics=metrics) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2383, in _save_checkpoint self._save_checkpoint(model, trial, metrics=metrics) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2383, in _save_checkpoint self._save_checkpoint(model, trial, metrics=metrics) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2383, in _save_checkpoint os.rename(staging_output_dir, output_dir) FileNotFoundError: [Errno 2] No such file or directory: '/home/kas/lx/code/MiniCPM-V/finetune/output/output_minicpmv2/tmp-checkpoint-2' -> '/home/kas/lx/code/MiniCPM-V/finetune/output/output_minicpmv2/checkpoint-2' os.rename(staging_output_dir, output_dir) FileNotFoundErroros.rename(staging_output_dir, output_dir) : [Errno 2] No such file or directory: '/home/kas/lx/code/MiniCPM-V/finetune/output/output_minicpmv2/tmp-checkpoint-2' -> '/home/kas/lx/code/MiniCPM-V/finetune/output/output_minicpmv2/checkpoint-2' FileNotFoundError: [Errno 2] No such file or directory: '/home/kas/lx/code/MiniCPM-V/finetune/output/output_minicpmv2/tmp-checkpoint-2' -> '/home/kas/lx/code/MiniCPM-V/finetune/output/output_minicpmv2/checkpoint-2' os.rename(staging_output_dir, output_dir) FileNotFoundError: [Errno 2] No such file or directory: '/home/kas/lx/code/MiniCPM-V/finetune/output/output_minicpmv2/tmp-checkpoint-2' -> '/home/kas/lx/code/MiniCPM-V/finetune/output/output_minicpmv2/checkpoint-2' Traceback (most recent call last): File "/home/kas/lx/code/MiniCPM-V/finetune/finetune.py", line 210, in
Traceback (most recent call last): File "/home/kas/lx/code/MiniCPM-V/finetune/finetune.py", line 210, in Traceback (most recent call last): File "/home/kas/lx/code/MiniCPM-V/finetune/finetune.py", line 210, in train() File "/home/kas/lx/code/MiniCPM-V/finetune/finetune.py", line 205, in train train() File "/home/kas/lx/code/MiniCPM-V/finetune/finetune.py", line 205, in train trainer.train() File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train train() File "/home/kas/lx/code/MiniCPM-V/finetune/finetune.py", line 205, in train trainer.train() File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train trainer.train() File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop return inner_training_loop( File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop return inner_training_loop( File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2356, in _save_checkpoint self._save_checkpoint(model, trial, metrics=metrics) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2356, in _save_checkpoint self._save_checkpoint(model, trial, metrics=metrics) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2356, in _save_checkpoint self._save_rng_state(staging_output_dir) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2419, in _save_rng_state self._save_rng_state(staging_output_dir) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2419, in save_rng_state self.save_rng_state(staging_output_dir) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2419, in save_rng_state torch.save(rng_states, os.path.join(output_dir, f"rng_state{self.args.process_index}.pth")) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/torch/serialization.py", line 618, in save torch.save(rng_states, os.path.join(output_dir, f"rng_state{self.args.process_index}.pth")) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/torch/serialization.py", line 618, in save torch.save(rng_states, os.path.join(output_dir, f"rng_state{self.args.process_index}.pth")) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/torch/serialization.py", line 618, in save with _open_zipfile_writer(f) as opened_zipfile: File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/torch/serialization.py", line 492, in _open_zipfile_writer with _open_zipfile_writer(f) as opened_zipfile: File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/torch/serialization.py", line 492, in _open_zipfile_writer with _open_zipfile_writer(f) as opened_zipfile: File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/torch/serialization.py", line 492, in _open_zipfile_writer return container(name_or_buffer) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/torch/serialization.py", line 463, in init return container(name_or_buffer) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/torch/serialization.py", line 463, in init super().init(torch._C.PyTorchFileWriter(self.name)) super().init(torch._C.PyTorchFileWriter(self.name)) RuntimeError: Parent directory /home/kas/lx/code/MiniCPM-V/finetune/output/output_minicpmv2/tmp-checkpoint-2 does not exist. RuntimeError: Parent directory /home/kas/lx/code/MiniCPM-V/finetune/output/output_minicpmv2/tmp-checkpoint-2 does not exist. return container(name_or_buffer) File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/torch/serialization.py", line 463, in init super().init(torch._C.PyTorchFileWriter(self.name)) RuntimeError: Parent directory /home/kas/lx/code/MiniCPM-V/finetune/output/output_minicpmv2/tmp-checkpoint-2 does not exist.
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
- OS:
- Python:3.10
- Transformers:4.36.0
- PyTorch:2.1.2
- CUDA: 12.1
备注 | Anything else?
No response
@qyc-98
在加大了数据量,save_step拉到40之后,同样出现了tmp-checkpoint-40不存在的报错
File "/home/kas/.conda/envs/MiniCPMV/lib/python3.10/site-packages/transformers/trainer.py", line 2383, in _save_checkpoint os.rename(staging_output_dir, output_dir) FileNotFoundError: [Errno 2] No such file or directory: '/home/kas/lx/code/MiniCPM-V/finetune/output/3d-100/output_minicpmv2/tmp-checkpoint-40' -> '/home/kas/lx/code/MiniCPM-V/finetune/output/3d-100/output_minicpmv2/checkpoint-40'
@Zmeo 你好,我也遇到了相同的问题,请问您这个问题最后怎么解决了?
@Zmeo 你好,我也遇到了相同的问题,请问您这个问题最后怎么解决了?
+1 我也多级微调遇到这个报错,另外一台服务器上的梯度保存不到主机上