第一个epoch后loss剧增，eval未正确生效

Open Moon-404 opened this issue 9 months ago • 1 comments

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.2.dev0
Platform: Linux-5.15.0-127-generic-x86_64-with-glibc2.35
Python version: 3.10.15
PyTorch version: 2.5.1 (GPU)
Transformers version: 4.48.2
Datasets version: 2.19.1
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA GeForce RTX 4090
GPU number: 4
GPU memory: 23.64GB
DeepSpeed version: 0.15.4

Reproduction

训练脚本

export CUDA_VISIBLE_DEVICES="1,2"
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml

yaml配置文件（***为隐私内容）

### model
model_name_or_path: /models/Qwen2.5-14B-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_dropout: 0.1
lora_rank: 8
lora_target: all
deepspeed: examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset: ***_train
eval_dataset: ***_eval
template: qwen
cutoff_len: 8192
max_samples: 65535
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /home/***/models/Qwen2.5-14B-sft-8
logging_steps: 10
save_steps: 609
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-4
num_train_epochs: 6.0
lr_scheduler_type: cosine
warmup_ratio: 0
bf16: true
ddp_timeout: 180000000

### eval
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 60

Others

我在一个月之前使用同样的方式训练了一个模型，当时的loss曲线如下：

这是我这次的loss曲线：

首先，在第一个epoch到第二个epoch之间（一个epoch是609steps），loss有一个剧烈升高的过程，这在之前是没有的。

剧烈升高的log是这样的：

{"current_steps": 890, "total_steps": 3654, "loss": 0.0748, "lr": 0.00043031206494748253, "epoch": 1.4614121510673235, "percentage": 24.36, "elapsed_time": "5:52:52", "remaining_time": "18:15:54"}
{"current_steps": 900, "total_steps": 3654, "loss": 0.0715, "lr": 0.0004288165657508376, "epoch": 1.477832512315271, "percentage": 24.63, "elapsed_time": "5:56:13", "remaining_time": "18:10:02"}
{"current_steps": 900, "total_steps": 3654, "epoch": 1.477832512315271, "percentage": 24.63, "elapsed_time": "6:00:11", "remaining_time": "18:22:11"}
{"current_steps": 910, "total_steps": 3654, "loss": 2.1873, "lr": 0.0004273078484937361, "epoch": 1.4942528735632183, "percentage": 24.9, "elapsed_time": "6:03:32", "remaining_time": "18:16:12"}
{"current_steps": 920, "total_steps": 3654, "loss": 4.8323, "lr": 0.0004257860247000508, "epoch": 1.5106732348111658, "percentage": 25.18, "elapsed_time": "6:06:52", "remaining_time": "18:10:16"}

其次，这次训练没有进行eval，正常eval的log是这样的：

{"current_steps": 60, "total_steps": 3654, "loss": 0.5847, "lr": 0.0004996674332229131, "epoch": 0.09852216748768473, "percentage": 1.64, "elapsed_time": "0:22:57", "remaining_time": "22:54:51"}
{"current_steps": 60, "total_steps": 3654, "eval_loss": 0.5544958114624023, "epoch": 0.09852216748768473, "percentage": 1.64, "elapsed_time": "0:27:01", "remaining_time": "1 day, 2:59:01"}

而本次eval的log是这样的：

{"current_steps": 60, "total_steps": 3654, "loss": 0.4497, "lr": 0.0004996674332229131, "epoch": 0.09852216748768473, "percentage": 1.64, "elapsed_time": "0:20:10", "remaining_time": "20:08:52"}
{"current_steps": 60, "total_steps": 3654, "epoch": 0.09852216748768473, "percentage": 1.64, "elapsed_time": "0:24:08", "remaining_time": "1 day, 0:06:21"}

所以说代码能跑就不要动它，这次训练前我pull了一下，没想到就寄了😭

Mar 03 '25 02:03 Moon-404

我用qwen2.5-7b做cpt也碰到了这个问题，你的问题解决了吗？

Mar 12 '25 01:03 xiadingZ