Is there an existing issue for this?
- [x] I have searched the existing issues
Current Behavior
[2023-06-02 00:34:14,470] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./output/deepspeed/checkpoint-1428/global_step1428/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-06-02 00:34:14,470] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./output/deepspeed/checkpoint-1428/global_step1428/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-06-02 00:34:14,493] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./output/deepspeed/checkpoint-1428/global_step1428/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-06-02 00:34:14,497] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./output/deepspeed/checkpoint-1428/global_step1428/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
Traceback (most recent call last):
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 441, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 668, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/3: file write failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 431, in
main()
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 370, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
return inner_training_loop(
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1996, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2242, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2303, in _save_checkpoint
self.deepspeed.save_checkpoint(output_dir)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2894, in save_checkpoint
self._save_zero_checkpoint(save_dir, tag)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3223, in _save_zero_checkpoint
self.checkpoint_engine.save(zero_sd, zero_checkpoint_name)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save
torch.save(state_dict, path)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 442, in save
return
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 291, in exit
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 10181728960 vs 10181728856
Traceback (most recent call last):
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 441, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 668, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/3: file write failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 431, in
main()
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 370, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
return inner_training_loop(
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1996, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2242, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2303, in _save_checkpoint
self.deepspeed.save_checkpoint(output_dir)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2894, in save_checkpoint
self._save_zero_checkpoint(save_dir, tag)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3223, in _save_zero_checkpoint
self.checkpoint_engine.save(zero_sd, zero_checkpoint_name)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save
torch.save(state_dict, path)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 442, in save
return
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 291, in exit
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 10181728960 vs 10181728856
Traceback (most recent call last):
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 441, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 668, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/3: file write failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 431, in
main()
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 370, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
return inner_training_loop(
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1996, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2242, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2303, in _save_checkpoint
self.deepspeed.save_checkpoint(output_dir)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2894, in save_checkpoint
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
self._save_zero_checkpoint(save_dir, tag)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3223, in _save_zero_checkpoint
self.checkpoint_engine.save(zero_sd, zero_checkpoint_name)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save
torch.save(state_dict, path)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 442, in save
return
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 291, in exit
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 10181728960 vs 10181728856
Traceback (most recent call last):
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 441, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 668, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/4: file write failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 431, in
main()
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/main.py", line 370, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/featurize/work/xxxx/git/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
return inner_training_loop(
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 1996, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2242, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/featurize/work/xxx/git/ChatGLM-6B/ptuning/trainer.py", line 2303, in _save_checkpoint
self.deepspeed.save_checkpoint(output_dir)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2894, in save_checkpoint
self._save_zero_checkpoint(save_dir, tag)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3223, in _save_zero_checkpoint
self.checkpoint_engine.save(zero_sd, zero_checkpoint_name)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save
torch.save(state_dict, path)
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 442, in save
return
File "/environment/miniconda3/envs/py39/lib/python3.9/site-packages/torch/serialization.py", line 291, in exit
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 12346575680 vs 12346575576
[2023-06-02 00:40:34,113] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 20636
[2023-06-02 00:40:40,691] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 20637
[2023-06-02 00:40:40,692] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 20638
[2023-06-02 00:40:40,711] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 20639
Expected Behavior
正常保存模型
Steps To Reproduce
使用deepspeed跑大模型全量微调
环境


自定义数据集,共7000条
共执行3次checkpoints,前两次成功执行,第三次报错,并且内存还剩20GB

Environment
OS: Ubuntu 20.04
Python: 3.9
Transformers: 4.26.1
PyTorch: 2.0.0
CUDA Support: True
Anything else?
No response