ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

用grpo训练qwen2.5-7b-instruct出现!!!!

Open zhangansen opened this issue 7 months ago • 1 comments

为什么我的模型在grpo训练刚开始第一条输出的时候输出正常,然后后面全变成这种!!! Image 我的数据集格式如下 Image

这是我写的奖励函数: class CorrectnessORM(ORM): def call(self, completions, **kwargs) -> List[float]:

    rewards = []
    messages = kwargs.get('messages', [])
    solution = kwargs.get('solution', [])
    
    for idx, completion in enumerate(completions):
        
        user_input = messages[idx][0]['content']  
        model_response = completion  
        gt = solution[idx] 
        
        
        reward = 2.0 if model_response == gt else 0.0
        rewards.append(reward)
        
        
        logger.info(f"User Input: {user_input}\nModel Response: {model_response}\nGround Truth: {gt}\nReward: {reward}\n-----------------")
        
        
    return rewards

运行脚本如下: CUDA_VISIBLE_DEVICES=0,1,2,3
NPROC_PER_NODE=4
swift rlhf
--rlhf_type grpo
--model /mnt/workspace/workgroup/base_model/Qwen2.5-7B-Instruct
--external_plugins /mnt/workspace/workgroup/ansen/ms-swift/grpo-self/plugin/plugin.py
--reward_funcs external_correctness
--use_vllm true
--output_dir './output_5_1'
--template qwen
--vllm_device auto
--vllm_gpu_memory_utilization 0.6
--vllm_max_model_len 4096
--num_infer_workers 4
--train_type lora
--torch_dtype bfloat16
--dataset '/mnt/workspace/workgroup/ansen/trl_grpo/processed_train.json'
--max_completion_length 4096
--num_train_epochs 3
--per_device_train_batch_size 2
--per_device_eval_batch_size 4
--learning_rate 1e-6
--gradient_accumulation_steps 16
--eval_steps 200
--save_steps 200
--save_total_limit 2
--logging_steps 1
--max_length 4096
--output_dir output
--warmup_ratio 0.05
--dataloader_num_workers 2
--dataset_num_proc 2
--num_generations 2
--temperature 0
--deepspeed zero2
--log_completions true
--report_to wandb
--acc_strategy seq
--eval_datasets '/mnt/workspace/workgroup/ansen/trl_grpo/processed_eval.json'

zhangansen avatar May 01 '25 10:05 zhangansen

温度设为0就会出现这种情况,改为其他的温度就没事了

zhangansen avatar May 06 '25 09:05 zhangansen