ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

Bug! Checkpoint resume failure - deepspeed different DP size. Is there a quick checkpoint converter anywere?

Open tjoymeed opened this issue 8 months ago • 1 comments

Hi all,

I am using the latext MS-SWIFT GRPO LORA training and I run the training on 4x8=32 GPUs.

And now I need to resume training on 2x8=16GPUs.

But simply adding --resume_from_checkpoint doesn't work. The deepspeed complains about different DP size.

I also tried the deepspeed universal converter, but it gave the following errors. How can I fix these bugs?

Also, is the MS-SWIFT GRPO LORA training saved checkpoints the full checkpoints or only the LoRA part without the base model(i.e hasn't combined yet)?


If I run the following command: python /myprojects/venv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py --input_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400 --output_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal

I got the following errors: [2025-04-24 12:59:09,183] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it. [2025-04-24 12:59:09,193] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect) args = Namespace(input_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400', output_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=False, strict=True, inject_missing_state=False) Convert DeepSpeed Checkpoint to Universal Checkpoint Converting DeepSpeed checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400 to Universal checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal Traceback (most recent call last): File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 549, in main(args) File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 474, in main optim_files = _get_optim_files(args.input_folder) File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 432, in _get_optim_files return _get_checkpoint_files(checkpoint_dir, "*_optim_states.pt") File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 443, in _get_checkpoint_files raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'") FileNotFoundError: can't find *_optim_states.pt files in directory '/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400'

If I run the following command: python /myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py --input_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/ --output_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal

I got the following errors: [2025-04-24 13:02:05,001] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it. [2025-04-24 13:02:05,014] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect) args = Namespace(input_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/', output_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=False, strict=True, inject_missing_state=False) Convert DeepSpeed Checkpoint to Universal Checkpoint Converting DeepSpeed checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/ to Universal checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal Traceback (most recent call last): File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 549, in main(args) File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 482, in main _check_for_required_state(ds_checkpoint) File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 466, in _check_for_required_state assert universal_checkpoint_info is not None, f'Required {UNIVERSAL_CHECKPOINT_INFO} state is missing in checkpoint. Verify that client creates this state.' AssertionError: Required universal_checkpoint_info state is missing in checkpoint. Verify that client creates this state.

tjoymeed avatar Apr 24 '25 20:04 tjoymeed

I am now trying to convert the lora-adaptor part.

After I convert the lora-adapter part,

I used "--peft-path" parameter in the training shell script to set the path to the new-lora-adapter, but the parameter seems doesn't work.

And then I used the "adapters" parameter in the training shell script and got the following error message:

raise ValueError(f'Please set --model <model_id_or_path>`, model: {self.model}')

ValueError: Please set --model <model_id_or_path>`, model: None

But I have already set it.

My shell script config looks like the following:

--model Qwen/Qwen2.5-7B-Instruct \
--adapters /myprojects/ms-swift/output/Qwen2.5-7B-Instruct/v3-20250423-132415/checkpoint-400-converted-lora-adapter \

What's wrong?

tjoymeed avatar Apr 24 '25 23:04 tjoymeed

Feel free to reopen if you have any issues.

hjh0119 avatar Jun 26 '25 12:06 hjh0119