ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

Bug! After resuming training, the info line doesn't show details anymore ...

Open tjoymeed opened this issue 8 months ago • 3 comments

As you can see, after resuming from training, the [INFO:swift] line got disappeared with only blanks "--------------------"

All I did was adding this line in the training script:

--resume_from_checkpoint /myprojects/ms-swift/output/Qwen2.5-7B-32GPUs/v3-20250423-132415/checkpoint-400

Nothing else.

What's wrong?

Could anybody please help me?

Thanks a lot!


[INFO:swift] -------------------------------------- INFO 04-24 22:03:27 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:27 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:28 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:28 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache

tjoymeed avatar Apr 25 '25 05:04 tjoymeed

--load_args true

slin000111 avatar Apr 27 '25 09:04 slin000111

The problem is when I do --resume_from_checkpoint,

all the parameters in the training sh script file are kept the same as the previous training runs, except only for the "--resume_from_checkpoint".

So why do I need to "load_args" explicitly?

Also, in the manual, it says when training it should be "False".

Thanks a lot!

tjoymeed avatar Apr 27 '25 16:04 tjoymeed

If you only add the parameter resume_from_checkpoint and do not delete other parameters, there is no need to add load_args. In addition, the latest main branch does not reproduce the above problem, the [INFO:swift] line got disappeared with only blanks "--------------------".

The training script is as follows, only the parameter resume_from_checkpoint is added to the original training script.

CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen2.5-7B-Instruct \
    --dataset 'ServiceNow-AI/R1-Distill-SFT:v1#2000' \
    --resume_from_checkpoint '/mnt/workspace/output/v4-20250428-102132/checkpoint-100' \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 1 \
    --gradient_checkpointing_kwargs '{"use_reentrant": false}'

slin000111 avatar Apr 28 '25 03:04 slin000111

This issue has been inactive for over 3 months and will be automatically closed in 7 days. If this issue is still relevant, please reply to this message.

github-actions[bot] avatar Jul 28 '25 00:07 github-actions[bot]

This issue has been automatically closed due to inactivity. If needed, it can be reopened.

github-actions[bot] avatar Aug 04 '25 00:08 github-actions[bot]