LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

Reward Model训练epoch出现回退

Open huyufeng0407 opened this issue 3 months ago • 0 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

启动命令 torchrun --nproc_per_node $NPROC_PER_NODE \ --nnodes $NNODES \ --node_rank $RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT \ ../../src/train_bash.py \ --deepspeed ds_z1_config.json \ --stage rm \ --do_train \ --model_name_or_path /root/model/CodeLlama-7b-hf/ \ --create_new_adapter \ --dataset codesftpreferv1 \ --dataset_dir ${DATA_DIR} \ --template default \ --finetuning_type full \ --output_dir ../../saves/LLaMA2-7B/full/sft4krm6w8kep5 \ --overwrite_cache \ --overwrite_output_dir \ --cutoff_len 8192 \ --preprocessing_num_workers 16 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 2 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --warmup_steps 20 \ --save_steps 250 \ --eval_steps 500 \ --evaluation_strategy steps \ --learning_rate 5e-5 \ --num_train_epochs 5.0 \ --max_samples 400000000 \ --val_size 0.001 \ --plot_loss \ --bf16 \ --save_total_limit 10 \ --report_to tensorboard

在8卡A800上训练reward model时,出现 1cae4321258759fd15e8b00ad

相关环境之前训练过SFT(4机8卡、单机8卡),均正常。

Expected behavior

No response

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • transformers version: 4.38.1
  • Platform: Linux-5.10.0-1.0.0.28-x86_64-with-glibc2.27
  • Python version: 3.9.18
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.1+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Others

No response

huyufeng0407 avatar Mar 12 '24 08:03 huyufeng0407