ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter

Open tjoymeed opened this issue 8 months ago • 1 comments

Hi all,

I am doing Model Scope MS-SWIFT GRPO RL training with lora.

When resume training from check-point, because I cannot directly do it due to the fact that my GPU cards numbers got reduced (ref: https://github.com/modelscope/ms-swift/issues/3989) , so I have to convert the check-point to the merged full model and then start the training from scratch from this merged full model.

And then in the training script, I supply my merged full model path.

swift rlhf
--rlhf_type grpo
--model /myprojects/ms-swift/output/Qwen2.5-7B-32GPUs/v3-20250423-132415/checkpoint-400-mergedfull
--model_type qwen2_5
--train_type lora \

Surprisingly, it hung/stuck after 1 step of training.

The whole program froze...

What's wrong?

Could anybody help?

Thanks!

tjoymeed avatar Apr 24 '25 23:04 tjoymeed

pip install py-spy
py-spy dump --pid <pid>

slin000111 avatar Apr 27 '25 09:04 slin000111

pip install py-spy py-spy dump --pid

For my case, the py-spy result is:

Process 3250260: /home/user/miniconda3/envs/swift/bin/python3.11 -u /home/user/Desktop/GRPO/grpo_swift/ms-swift/swift/cli/rlhf.py --rlhf_type grpo --model /home/user/Desktop/GRPO/grpo_swift/output/sft/v4-20250530-192816/checkpoint-2319-merged --reward_funcs external_r1v_acc format --reward_weights 1 0.5 --train_type lora --lora_rank 8 --lora_alpha 16 --target_modules all-linear --torch_dtype bfloat16 --dataset open-r1/OpenThoughts-114k-math --external_plugins /home/yyq/Desktop/GRPO/grpo_swift/ms-swift/examples/train/grpo/plugin/plugin.py --max_completion_length 4096 --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-5 --gradient_accumulation_steps 4 --eval_steps 100 --save_steps 100 --save_total_limit 2 --logging_steps 5 --max_length 8192 --output_dir output/grpo --warmup_ratio 0.05 --dataloader_num_workers 64 --dataset_num_proc 4 --num_generations 4 --temperature 1. --top_p 0.99 --top_k 50 --system /home/yyq/Desktop/GRPO/grpo_swift/multi_turn_grpo/tool_system.txt --deepspeed zero3 --log_completions true --report_to swanlab --swanlab_project GRPO --use_vllm true --vllm_mode colocate --offload_model true --offload_optimize true --vllm_gpu_memory_utilization 0.5 --vllm_tensor_parallel 4
Python v3.11.11 (/home/user/miniconda3/envs/swift/bin/python3.11)
Thread 3250260 (idle): "MainThread"
    __call__ (torch/_ops.py:1158)
    forward (vllm/v1/attention/backends/flash_attn.py:577)
    unified_attention_with_output (vllm/attention/layer.py:425)
    __call__ (torch/_ops.py:1158)
    forward (<eval_with_key>.12:5)
    _call_impl (torch/nn/modules/module.py:1762)
    _wrapped_call_impl (torch/nn/modules/module.py:1751)
    __call__ (torch/fx/graph_module.py:393)
    call_wrapped (torch/fx/graph_module.py:830)
    forward (<eval_with_key>.74:339)
    _call_impl (torch/nn/modules/module.py:1762)
    _wrapped_call_impl (torch/nn/modules/module.py:1751)
    __call__ (torch/fx/graph_module.py:393)
    call_wrapped (torch/fx/graph_module.py:830)
    _fn (torch/_dynamo/eval_frame.py:838)
    _call_impl (torch/nn/modules/module.py:1762)
    _wrapped_call_impl (torch/nn/modules/module.py:1751)
    forward (vllm/model_executor/models/qwen2.py:340)
    __call__ (vllm/compilation/decorators.py:245)
    forward (vllm/model_executor/models/qwen3.py:300)
    _call_impl (torch/nn/modules/module.py:1762)
    _wrapped_call_impl (torch/nn/modules/module.py:1751)
    execute_model (vllm/v1/worker/gpu_model_runner.py:1196)
    decorate_context (torch/utils/_contextlib.py:116)
    execute_model (vllm/v1/worker/gpu_worker.py:276)
    decorate_context (torch/utils/_contextlib.py:116)
    run_method (vllm/utils.py:2605)
    collective_rpc (vllm/executor/uniproc_executor.py:56)
    execute_model (vllm/v1/executor/abstract.py:86)
    execute_model (vllm/v1/engine/core.py:207)
    step (vllm/v1/engine/core.py:226)
    get_output (vllm/v1/engine/core_client.py:209)
    step (vllm/v1/engine/llm_engine.py:231)
    infer (swift/llm/infer/infer_engine/vllm_engine.py:475)
    _engine_infer (swift/trainers/rlhf_trainer/grpo_trainer.py:1318)
    _infer (swift/trainers/rlhf_trainer/grpo_trainer.py:584)
    _infer_single_or_multi_turn (swift/trainers/rlhf_trainer/grpo_trainer.py:623)
    _fast_infer (swift/trainers/rlhf_trainer/grpo_trainer.py:768)
    _generate_completions (swift/trainers/rlhf_trainer/grpo_trainer.py:792)
    _generate_and_score_completions (swift/trainers/rlhf_trainer/grpo_trainer.py:817)
    _prepare_inputs (swift/trainers/rlhf_trainer/grpo_trainer.py:321)
    wrapper (trl/extras/profiling.py:96)
    training_step (transformers/trainer.py:3739)
    training_step (swift/trainers/rlhf_trainer/grpo_trainer.py:1305)
    _inner_training_loop (transformers/trainer.py:2555)
    train (transformers/trainer.py:2240)
    train (swift/trainers/mixin.py:369)
    train (swift/llm/train/sft.py:182)
    run (swift/llm/train/sft.py:122)
    main (swift/llm/base.py:49)
    rlhf_main (swift/llm/train/rlhf.py:169)
    <module> (swift/cli/rlhf.py:5)
Thread 3250481 (idle): "Thread-1 (_read_thread)"
    _recv_msg (torch/_inductor/compile_worker/subproc_pool.py:55)
    _read_thread (torch/_inductor/compile_worker/subproc_pool.py:191)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3250951 (idle): "Thread-2"
    wait (threading.py:331)
    wait (threading.py:629)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3252597 (idle): "Thread-3 (_report_usage_worker)"
    _report_continuous_usage (vllm/usage/usage_lib.py:229)
    _report_usage_worker (vllm/usage/usage_lib.py:164)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3252825 (idle): "Thread-4"
    wait (threading.py:331)
    wait (threading.py:629)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3254651 (idle): "MsgUploader"
    new_task (swanlab/data/cloud/start_thread.py:120)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255360 (idle): "Thread-13 (_pin_memory_loop)"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:948)
    _poll (multiprocessing/connection.py:440)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    do_one_step (torch/utils/data/_utils/pin_memory.py:37)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:61)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255361 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255362 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255363 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255364 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255365 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255366 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255367 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255368 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255369 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255370 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255371 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255372 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255373 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255375 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255376 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255380 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255382 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255391 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255392 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255395 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255396 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255397 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255399 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255400 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255402 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255404 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255406 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255408 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255410 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255412 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255415 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255416 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255418 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255421 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255423 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255425 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255426 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255427 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255429 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255432 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255434 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255435 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255438 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255440 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255441 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255443 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255446 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255448 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255450 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255453 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255455 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255458 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255459 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255462 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255464 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255465 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255467 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255469 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255471 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255473 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255475 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255476 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255479 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3255481 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 3257098 (idle)
Thread 3257101 (idle)
Thread 3257104 (idle)
Thread 3257108 (idle)
Thread 3335904 (idle): "Thread-73"
    wait (threading.py:331)
    wait (threading.py:629)
    run (threading.py:1399)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)

I think all cards hangs because of the communication between gpus. Like in nvtop

Image

shepardyan avatar Jun 03 '25 03:06 shepardyan

Same issue

ShuoSIr7 avatar Jun 04 '25 01:06 ShuoSIr7

Try the lastest main code please, we fixed this bug yesterday.

tastelikefeet avatar Jun 04 '25 02:06 tastelikefeet

Try the lastest main code please, we fixed this bug yesterday.

Hi, thanks for reply, I just found you fixed the seed when vllm_tensor_parallel_size > 1, but this script doesn't explicitly set vllm_tensor_parallel_size, so it should default to 1. Does this commit fix the issue where the program would hang in such a case?

ShuoSIr7 avatar Jun 04 '25 06:06 ShuoSIr7

any repro script?

hjh0119 avatar Jun 04 '25 06:06 hjh0119

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ NPROC_PER_NODE=8 \ swift rlhf \ --rlhf_type grpo \ --model $base_model \ --dataset $train_data \ --output_dir $out_dir \ --num_generations 4 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --per_device_eval_batch_size 2 \ --temperature 1.0 \ --train_type lora \ --learning_rate 1e-5 \ --lora_rank 8 \ --loss_type grpo \ --gradient_checkpointing_kwargs '{"use_reentrant": false}' \ --max_completion_length 1024 \ --num_train_epochs 1 \ --save_steps 0.1 \ --save_total_limit 1 \ --dataloader_num_workers 4 \ --dataset_num_proc 4 \ --logging_steps 1 2>&1 \ --reward_funcs format \ --reward_weights 1 \ --epsilon_high 0.25 \ --max_resample_times 2 \ --overlong_filter true \ --dynamic_sample true \ --sleep_level 1 \ --use_vllm true \ --vllm_mode colocate \ --vllm_max_model_len 15000 \ --vllm_gpu_memory_utilization 0.5 \ --repetition_penalty 1.05 \ --report_to wandb \ --log_completions true base_model: Qwen3-8B train_data: my own data with prompt length range from 5000 to 15000 I tried to reproduce with the repo's demo data, but I failed, it seems that the problem only occurs with long sequences.

ShuoSIr7 avatar Jun 04 '25 06:06 ShuoSIr7

for long sequnces , maybe you can try sequence parallel

hjh0119 avatar Jun 04 '25 10:06 hjh0119

for long sequnces , maybe you can try sequence parallel

ok, thanks

ShuoSIr7 avatar Jun 04 '25 10:06 ShuoSIr7