ChatGLM-6B icon indicating copy to clipboard operation
ChatGLM-6B copied to clipboard

[BUG] PytorchStreamWriter failed writing file data/3: file write failed

Open MDGBDGMG opened this issue 1 year ago • 4 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current Behavior

[2023-06-02 00:34:14,470] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./output/deepspeed/checkpoint-1428/global_step1428/zero_pp_rank_0_mp_rank_00_model_states.pt [2023-06-02 00:34:14,470] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./output/deepspeed/checkpoint-1428/global_step1428/zero_pp_rank_0_mp_rank_00_model_states.pt... [2023-06-02 00:34:14,493] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./output/deepspeed/checkpoint-1428/global_step1428/zero_pp_rank_0_mp_rank_00_model_states.pt. [2023-06-02 00:34:14,497] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./output/deepspeed/checkpoint-1428/global_step1428/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... Traceback (most recent call last): File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 441, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 668, in _save zip_file.write_record(name, storage.data_ptr(), num_bytes) RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/3: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 431, in main() File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 370, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1635, in train return inner_training_loop( File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1996, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2242, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2303, in _save_checkpoint self.deepspeed.save_checkpoint(output_dir) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2894, in save_checkpoint self._save_zero_checkpoint(save_dir, tag) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3223, in _save_zero_checkpoint self.checkpoint_engine.save(zero_sd, zero_checkpoint_name) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save torch.save(state_dict, path) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 442, in save return File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 291, in exit self.file_like.write_end_of_file() RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 10181728960 vs 10181728856 Traceback (most recent call last): File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 441, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 668, in _save zip_file.write_record(name, storage.data_ptr(), num_bytes) RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/3: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 431, in main() File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 370, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1635, in train return inner_training_loop( File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1996, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2242, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2303, in _save_checkpoint self.deepspeed.save_checkpoint(output_dir) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2894, in save_checkpoint self._save_zero_checkpoint(save_dir, tag) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3223, in _save_zero_checkpoint self.checkpoint_engine.save(zero_sd, zero_checkpoint_name) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save torch.save(state_dict, path) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 442, in save return File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 291, in exit self.file_like.write_end_of_file() RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 10181728960 vs 10181728856 Traceback (most recent call last): File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 441, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 668, in _save zip_file.write_record(name, storage.data_ptr(), num_bytes) RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/3: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 431, in main() File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 370, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1635, in train return inner_training_loop( File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1996, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2242, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2303, in _save_checkpoint self.deepspeed.save_checkpoint(output_dir) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2894, in save_checkpoint wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. self._save_zero_checkpoint(save_dir, tag) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3223, in _save_zero_checkpoint self.checkpoint_engine.save(zero_sd, zero_checkpoint_name) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save torch.save(state_dict, path) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 442, in save return File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 291, in exit self.file_like.write_end_of_file() RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 10181728960 vs 10181728856 Traceback (most recent call last): File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 441, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 668, in _save zip_file.write_record(name, storage.data_ptr(), num_bytes) RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/4: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 431, in main() File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 370, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/featurize/work/xxxx/git/ChatGLM-6B/ptuning/trainer.py", line 1635, in train return inner_training_loop( File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1996, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2242, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2303, in _save_checkpoint self.deepspeed.save_checkpoint(output_dir) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2894, in save_checkpoint self._save_zero_checkpoint(save_dir, tag) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3223, in _save_zero_checkpoint self.checkpoint_engine.save(zero_sd, zero_checkpoint_name) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save torch.save(state_dict, path) File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 442, in save return File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 291, in exit self.file_like.write_end_of_file() RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 12346575680 vs 12346575576 [2023-06-02 00:40:34,113] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 20636 [2023-06-02 00:40:40,691] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 20637 [2023-06-02 00:40:40,692] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 20638 [2023-06-02 00:40:40,711] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 20639

Expected Behavior

正常保存模型

Steps To Reproduce

使用deepspeed跑大模型全量微调 环境 image

image

自定义数据集,共7000条 共执行3次checkpoints,前两次成功执行,第三次报错,并且内存还剩20GB

image

Environment

OS: Ubuntu 20.04
Python: 3.9
Transformers: 4.26.1
PyTorch: 2.0.0
CUDA Support: True

Anything else?

No response

MDGBDGMG avatar Jun 01 '23 17:06 MDGBDGMG

解决了吗TvT

RyAkagiC avatar Jun 08 '23 09:06 RyAkagiC

Same problem.

dijkstra-mose avatar Jun 15 '23 07:06 dijkstra-mose

相同的问题

yangxingrui avatar Jun 15 '23 08:06 yangxingrui

same problem but in another project

ahnaf-al avatar Jun 19 '23 04:06 ahnaf-al

我是第一次成功,第二次报错

Xzaohui avatar Jun 30 '23 07:06 Xzaohui

same problem, i didn't solve the problem, but i thought the reason might be not having enough disk space. I will reallocate more space to HOME and try again.

MegatronZhang avatar Jul 03 '23 09:07 MegatronZhang

same problem, when i try it again, i got: File "/root/.local/lib/python3.10/site-packages/tensorboardX/record_writer.py", line 193, in flush 2023-07-12 14:13:23.433 [ERROR] xxxxxxxxxxx: self._writer.flush() 2023-07-12 14:13:23.434 [ERROR] xxxxxxxxxxx: OSError: [Errno 122] Disk quota exceeded

notice by @MegatronZhang, i check my disk space, it works

gLinxi avatar Jul 12 '23 06:07 gLinxi

same problem, i didn't solve the problem, but i thought the reason might be not having enough disk space. I will reallocate more space to HOME and try again.

Have you solved it yet? When I encountered this problem, I used the free -h command and found that there was 180G free, enough disk space. Why did I report this error?

YinSonglin1997 avatar Jul 27 '23 02:07 YinSonglin1997

解决了吗,老铁?

uct8086 avatar Aug 21 '23 11:08 uct8086

我用的是网络上的虚拟机,有两个硬盘,一个是网络硬盘,一个是本地硬盘。我用网络硬盘,由于网络性能不佳,导致了这个问题。更换成本地硬盘,就解决了。 在本地硬盘加载模型,然后训练,然后将结果写入本地硬盘。就ok了。

MDGBDGMG avatar Oct 11 '23 03:10 MDGBDGMG