DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Bug in model save with Zero stage 3
File "main.py", line 334, in main
save_hf_format(model, tokenizer, args)
File ".../applications/DeepSpeed-Chat/training/utils/utils.py", line 51, in save_hf_format
os.makedirs(output_dir)
File "/usr/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '.../output/actor/'
When the trained model is saved https://github.com/microsoft/DeepSpeedExamples/blob/dcf67c001702811bfea7aec715844882bb44ee77/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L323
Rank 0 ensures that the output folder is not exist: https://github.com/microsoft/DeepSpeedExamples/blob/dcf67c001702811bfea7aec715844882bb44ee77/applications/DeepSpeed-Chat/training/utils/utils.py#L50
Then the folder is first created in this line by rank != 1: https://github.com/microsoft/DeepSpeedExamples/blob/dcf67c001702811bfea7aec715844882bb44ee77/applications/DeepSpeed-Chat/training/utils/utils.py#L132
On this line, the program crashes because there is no exist_ok=True:
https://github.com/microsoft/DeepSpeedExamples/blob/dcf67c001702811bfea7aec715844882bb44ee77/applications/DeepSpeed-Chat/training/utils/utils.py#L51
Probably here you need to replace https://github.com/microsoft/DeepSpeedExamples/blob/dcf67c001702811bfea7aec715844882bb44ee77/applications/DeepSpeed-Chat/training/utils/utils.py#L50 here
if not os.path.exists(output_dir):
os.makedirs(output_dir)
to
os.makedirs(save_dir, exist_ok=True)
Meet the same problem. Thank you very much! @s-isaev
@s-isaev thank you for identifying the root cause. Would you like a create a PR to fix the problem?
@s-isaev thank you for identifying the root cause. Would you like a create a PR to fix the problem?
@yaozhewei Waiting for approve https://github.com/microsoft/DeepSpeedExamples/pull/415
Thanks @s-isaev.