DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Bug in model save with Zero stage 3

Open s-isaev opened this issue 2 years ago • 1 comments

File "main.py", line 334, in main
    save_hf_format(model, tokenizer, args)
  File ".../applications/DeepSpeed-Chat/training/utils/utils.py", line 51, in save_hf_format
    os.makedirs(output_dir)
  File "/usr/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '.../output/actor/'

When the trained model is saved https://github.com/microsoft/DeepSpeedExamples/blob/dcf67c001702811bfea7aec715844882bb44ee77/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L323

Rank 0 ensures that the output folder is not exist: https://github.com/microsoft/DeepSpeedExamples/blob/dcf67c001702811bfea7aec715844882bb44ee77/applications/DeepSpeed-Chat/training/utils/utils.py#L50

Then the folder is first created in this line by rank != 1: https://github.com/microsoft/DeepSpeedExamples/blob/dcf67c001702811bfea7aec715844882bb44ee77/applications/DeepSpeed-Chat/training/utils/utils.py#L132

On this line, the program crashes because there is no exist_ok=True: https://github.com/microsoft/DeepSpeedExamples/blob/dcf67c001702811bfea7aec715844882bb44ee77/applications/DeepSpeed-Chat/training/utils/utils.py#L51

Probably here you need to replace https://github.com/microsoft/DeepSpeedExamples/blob/dcf67c001702811bfea7aec715844882bb44ee77/applications/DeepSpeed-Chat/training/utils/utils.py#L50 here

if not os.path.exists(output_dir):
        os.makedirs(output_dir)

to

os.makedirs(save_dir, exist_ok=True)

s-isaev avatar Apr 20 '23 09:04 s-isaev

Meet the same problem. Thank you very much! @s-isaev

LuciusMos avatar Apr 20 '23 13:04 LuciusMos

@s-isaev thank you for identifying the root cause. Would you like a create a PR to fix the problem?

yaozhewei avatar Apr 24 '23 04:04 yaozhewei

@s-isaev thank you for identifying the root cause. Would you like a create a PR to fix the problem?

@yaozhewei Waiting for approve https://github.com/microsoft/DeepSpeedExamples/pull/415

s-isaev avatar Apr 24 '23 15:04 s-isaev

Thanks @s-isaev.

tjruwase avatar Apr 24 '23 16:04 tjruwase