Bug! After resuming training, the info line doesn't show details anymore ...
As you can see, after resuming from training, the [INFO:swift] line got disappeared with only blanks "--------------------"
All I did was adding this line in the training script:
--resume_from_checkpoint /myprojects/ms-swift/output/Qwen2.5-7B-32GPUs/v3-20250423-132415/checkpoint-400
Nothing else.
What's wrong?
Could anybody please help me?
Thanks a lot!
[INFO:swift] -------------------------------------- INFO 04-24 22:03:27 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:27 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:28 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:28 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache
--load_args true
The problem is when I do --resume_from_checkpoint,
all the parameters in the training sh script file are kept the same as the previous training runs, except only for the "--resume_from_checkpoint".
So why do I need to "load_args" explicitly?
Also, in the manual, it says when training it should be "False".
Thanks a lot!
If you only add the parameter resume_from_checkpoint and do not delete other parameters, there is no need to add load_args.
In addition, the latest main branch does not reproduce the above problem, the [INFO:swift] line got disappeared with only blanks "--------------------".
The training script is as follows, only the parameter resume_from_checkpoint is added to the original training script.
CUDA_VISIBLE_DEVICES=0 \
swift sft \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset 'ServiceNow-AI/R1-Distill-SFT:v1#2000' \
--resume_from_checkpoint '/mnt/workspace/output/v4-20250428-102132/checkpoint-100' \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 1e-4 \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--gradient_accumulation_steps 16 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 5 \
--max_length 2048 \
--output_dir output \
--warmup_ratio 0.05 \
--dataloader_num_workers 1 \
--gradient_checkpointing_kwargs '{"use_reentrant": false}'
This issue has been inactive for over 3 months and will be automatically closed in 7 days. If this issue is still relevant, please reply to this message.
This issue has been automatically closed due to inactivity. If needed, it can be reopened.