Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter
Hi all,
I am doing Model Scope MS-SWIFT GRPO RL training with lora.
When resume training from check-point, because I cannot directly do it due to the fact that my GPU cards numbers got reduced (ref: https://github.com/modelscope/ms-swift/issues/3989) , so I have to convert the check-point to the merged full model and then start the training from scratch from this merged full model.
And then in the training script, I supply my merged full model path.
swift rlhf
--rlhf_type grpo
--model /myprojects/ms-swift/output/Qwen2.5-7B-32GPUs/v3-20250423-132415/checkpoint-400-mergedfull
--model_type qwen2_5
--train_type lora \
Surprisingly, it hung/stuck after 1 step of training.
The whole program froze...
What's wrong?
Could anybody help?
Thanks!
pip install py-spy
py-spy dump --pid <pid>
pip install py-spy py-spy dump --pid
For my case, the py-spy result is:
Process 3250260: /home/user/miniconda3/envs/swift/bin/python3.11 -u /home/user/Desktop/GRPO/grpo_swift/ms-swift/swift/cli/rlhf.py --rlhf_type grpo --model /home/user/Desktop/GRPO/grpo_swift/output/sft/v4-20250530-192816/checkpoint-2319-merged --reward_funcs external_r1v_acc format --reward_weights 1 0.5 --train_type lora --lora_rank 8 --lora_alpha 16 --target_modules all-linear --torch_dtype bfloat16 --dataset open-r1/OpenThoughts-114k-math --external_plugins /home/yyq/Desktop/GRPO/grpo_swift/ms-swift/examples/train/grpo/plugin/plugin.py --max_completion_length 4096 --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-5 --gradient_accumulation_steps 4 --eval_steps 100 --save_steps 100 --save_total_limit 2 --logging_steps 5 --max_length 8192 --output_dir output/grpo --warmup_ratio 0.05 --dataloader_num_workers 64 --dataset_num_proc 4 --num_generations 4 --temperature 1. --top_p 0.99 --top_k 50 --system /home/yyq/Desktop/GRPO/grpo_swift/multi_turn_grpo/tool_system.txt --deepspeed zero3 --log_completions true --report_to swanlab --swanlab_project GRPO --use_vllm true --vllm_mode colocate --offload_model true --offload_optimize true --vllm_gpu_memory_utilization 0.5 --vllm_tensor_parallel 4
Python v3.11.11 (/home/user/miniconda3/envs/swift/bin/python3.11)
Thread 3250260 (idle): "MainThread"
__call__ (torch/_ops.py:1158)
forward (vllm/v1/attention/backends/flash_attn.py:577)
unified_attention_with_output (vllm/attention/layer.py:425)
__call__ (torch/_ops.py:1158)
forward (<eval_with_key>.12:5)
_call_impl (torch/nn/modules/module.py:1762)
_wrapped_call_impl (torch/nn/modules/module.py:1751)
__call__ (torch/fx/graph_module.py:393)
call_wrapped (torch/fx/graph_module.py:830)
forward (<eval_with_key>.74:339)
_call_impl (torch/nn/modules/module.py:1762)
_wrapped_call_impl (torch/nn/modules/module.py:1751)
__call__ (torch/fx/graph_module.py:393)
call_wrapped (torch/fx/graph_module.py:830)
_fn (torch/_dynamo/eval_frame.py:838)
_call_impl (torch/nn/modules/module.py:1762)
_wrapped_call_impl (torch/nn/modules/module.py:1751)
forward (vllm/model_executor/models/qwen2.py:340)
__call__ (vllm/compilation/decorators.py:245)
forward (vllm/model_executor/models/qwen3.py:300)
_call_impl (torch/nn/modules/module.py:1762)
_wrapped_call_impl (torch/nn/modules/module.py:1751)
execute_model (vllm/v1/worker/gpu_model_runner.py:1196)
decorate_context (torch/utils/_contextlib.py:116)
execute_model (vllm/v1/worker/gpu_worker.py:276)
decorate_context (torch/utils/_contextlib.py:116)
run_method (vllm/utils.py:2605)
collective_rpc (vllm/executor/uniproc_executor.py:56)
execute_model (vllm/v1/executor/abstract.py:86)
execute_model (vllm/v1/engine/core.py:207)
step (vllm/v1/engine/core.py:226)
get_output (vllm/v1/engine/core_client.py:209)
step (vllm/v1/engine/llm_engine.py:231)
infer (swift/llm/infer/infer_engine/vllm_engine.py:475)
_engine_infer (swift/trainers/rlhf_trainer/grpo_trainer.py:1318)
_infer (swift/trainers/rlhf_trainer/grpo_trainer.py:584)
_infer_single_or_multi_turn (swift/trainers/rlhf_trainer/grpo_trainer.py:623)
_fast_infer (swift/trainers/rlhf_trainer/grpo_trainer.py:768)
_generate_completions (swift/trainers/rlhf_trainer/grpo_trainer.py:792)
_generate_and_score_completions (swift/trainers/rlhf_trainer/grpo_trainer.py:817)
_prepare_inputs (swift/trainers/rlhf_trainer/grpo_trainer.py:321)
wrapper (trl/extras/profiling.py:96)
training_step (transformers/trainer.py:3739)
training_step (swift/trainers/rlhf_trainer/grpo_trainer.py:1305)
_inner_training_loop (transformers/trainer.py:2555)
train (transformers/trainer.py:2240)
train (swift/trainers/mixin.py:369)
train (swift/llm/train/sft.py:182)
run (swift/llm/train/sft.py:122)
main (swift/llm/base.py:49)
rlhf_main (swift/llm/train/rlhf.py:169)
<module> (swift/cli/rlhf.py:5)
Thread 3250481 (idle): "Thread-1 (_read_thread)"
_recv_msg (torch/_inductor/compile_worker/subproc_pool.py:55)
_read_thread (torch/_inductor/compile_worker/subproc_pool.py:191)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3250951 (idle): "Thread-2"
wait (threading.py:331)
wait (threading.py:629)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3252597 (idle): "Thread-3 (_report_usage_worker)"
_report_continuous_usage (vllm/usage/usage_lib.py:229)
_report_usage_worker (vllm/usage/usage_lib.py:164)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3252825 (idle): "Thread-4"
wait (threading.py:331)
wait (threading.py:629)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3254651 (idle): "MsgUploader"
new_task (swanlab/data/cloud/start_thread.py:120)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255360 (idle): "Thread-13 (_pin_memory_loop)"
select (selectors.py:415)
wait (multiprocessing/connection.py:948)
_poll (multiprocessing/connection.py:440)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
do_one_step (torch/utils/data/_utils/pin_memory.py:37)
_pin_memory_loop (torch/utils/data/_utils/pin_memory.py:61)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255361 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255362 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255363 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255364 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255365 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255366 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255367 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255368 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255369 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255370 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255371 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255372 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255373 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255375 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255376 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255380 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255382 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255391 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255392 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255395 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255396 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255397 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255399 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255400 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255402 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255404 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255406 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255408 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255410 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255412 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255415 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255416 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255418 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255421 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255423 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255425 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255426 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255427 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255429 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255432 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255434 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255435 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255438 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255440 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255441 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255443 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255446 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255448 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255450 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255453 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255455 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255458 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255459 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255462 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255464 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255465 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255467 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255469 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255471 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255473 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255475 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255476 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255479 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3255481 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 3257098 (idle)
Thread 3257101 (idle)
Thread 3257104 (idle)
Thread 3257108 (idle)
Thread 3335904 (idle): "Thread-73"
wait (threading.py:331)
wait (threading.py:629)
run (threading.py:1399)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
I think all cards hangs because of the communication between gpus. Like in nvtop
Same issue
Try the lastest main code please, we fixed this bug yesterday.
Try the lastest main code please, we fixed this bug yesterday.
Hi, thanks for reply, I just found you fixed the seed when vllm_tensor_parallel_size > 1, but this script doesn't explicitly set vllm_tensor_parallel_size, so it should default to 1. Does this commit fix the issue where the program would hang in such a case?
any repro script?
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ NPROC_PER_NODE=8 \ swift rlhf \ --rlhf_type grpo \ --model $base_model \ --dataset $train_data \ --output_dir $out_dir \ --num_generations 4 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --per_device_eval_batch_size 2 \ --temperature 1.0 \ --train_type lora \ --learning_rate 1e-5 \ --lora_rank 8 \ --loss_type grpo \ --gradient_checkpointing_kwargs '{"use_reentrant": false}' \ --max_completion_length 1024 \ --num_train_epochs 1 \ --save_steps 0.1 \ --save_total_limit 1 \ --dataloader_num_workers 4 \ --dataset_num_proc 4 \ --logging_steps 1 2>&1 \ --reward_funcs format \ --reward_weights 1 \ --epsilon_high 0.25 \ --max_resample_times 2 \ --overlong_filter true \ --dynamic_sample true \ --sleep_level 1 \ --use_vllm true \ --vllm_mode colocate \ --vllm_max_model_len 15000 \ --vllm_gpu_memory_utilization 0.5 \ --repetition_penalty 1.05 \ --report_to wandb \ --log_completions true
base_model: Qwen3-8B
train_data: my own data with prompt length range from 5000 to 15000
I tried to reproduce with the repo's demo data, but I failed, it seems that the problem only occurs with long sequences.
for long sequnces , maybe you can try sequence parallel
for long sequnces , maybe you can try sequence parallel
ok, thanks