Bug! Checkpoint resume failure - deepspeed different DP size. Is there a quick checkpoint converter anywere?
Hi all,
I am using the latext MS-SWIFT GRPO LORA training and I run the training on 4x8=32 GPUs.
And now I need to resume training on 2x8=16GPUs.
But simply adding --resume_from_checkpoint doesn't work. The deepspeed complains about different DP size.
I also tried the deepspeed universal converter, but it gave the following errors. How can I fix these bugs?
Also, is the MS-SWIFT GRPO LORA training saved checkpoints the full checkpoints or only the LoRA part without the base model(i.e hasn't combined yet)?
If I run the following command: python /myprojects/venv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py --input_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400 --output_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal
I got the following errors:
[2025-04-24 12:59:09,183] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-04-24 12:59:09,193] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect)
args = Namespace(input_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400', output_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=False, strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400 to Universal checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal
Traceback (most recent call last):
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 549, in
If I run the following command: python /myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py --input_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/ --output_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal
I got the following errors:
[2025-04-24 13:02:05,001] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-04-24 13:02:05,014] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect)
args = Namespace(input_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/', output_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=False, strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/ to Universal checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal
Traceback (most recent call last):
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 549, in
I am now trying to convert the lora-adaptor part.
After I convert the lora-adapter part,
I used "--peft-path" parameter in the training shell script to set the path to the new-lora-adapter, but the parameter seems doesn't work.
And then I used the "adapters" parameter in the training shell script and got the following error message:
raise ValueError(f'Please set --model <model_id_or_path>`, model: {self.model}')
ValueError: Please set --model <model_id_or_path>`, model: None
But I have already set it.
My shell script config looks like the following:
--model Qwen/Qwen2.5-7B-Instruct \
--adapters /myprojects/ms-swift/output/Qwen2.5-7B-Instruct/v3-20250423-132415/checkpoint-400-converted-lora-adapter \
What's wrong?
Feel free to reopen if you have any issues.