LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

训练日志记录不完整

Open zhangfan-algo opened this issue 2 months ago • 3 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

torchrun --nproc_per_node ${num_gpu_per_node} --master_port $MASTER_PORT --master_addr $MASTER_ADDR --node_rank $RANK --nnodes $WORLD_SIZE src/train.py
--stage sft
--model_name_or_path /mnt/cluster/zhangfan/models/01-ai/Yi-1.5-9B-Chat
--do_train
--do_eval
--dataset user_sft_prompt_0516_train_classfiy
--template yi
--finetuning_type full
--output_dir /mnt/cluster/test
--preprocessing_num_workers 60
--dataloader_num_workers 60
--val_size 0.03
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 16
--gradient_checkpointing true
--cutoff_len 17000
--max_new_tokens 1000
--max_length 16000
--deepspeed examples/deepspeed/ds_z2_config.json
--logging_strategy steps
--logging_first_step
--logging_steps 1
--save_strategy epoch
--evaluation_strategy steps
--eval_steps 50
--num_train_epochs 10
--lr_scheduler_type cosine
--learning_rate 1e-5
--flash_attn auto
--plot_loss
--bf16
--rope_scaling linear
--save_on_each_node false
--neftune_noise_alpha 5

Expected behavior

训练日志打印不完整,而且只打印了前900步

System Info

{"current_steps": 1, "total_steps": 850, "loss": 3.9585, "learning_rate": 9.999965849158597e-06, "epoch": 0.005871559633027523, "percentage": 0.12, "elapsed_time": "0:03:07", "remaining_time": "1 day, 20:18:20"} {"current_steps": 10, "total_steps": 850, "loss": 0.7958, "learning_rate": 9.996585300715117e-06, "epoch": 0.05871559633027523, "percentage": 1.18, "elapsed_time": "0:28:37", "remaining_time": "1 day, 16:04:13"} {"current_steps": 20, "total_steps": 850, "loss": 0.2891, "learning_rate": 9.98634586692894e-06, "epoch": 0.11743119266055047, "percentage": 2.35, "elapsed_time": "0:59:04", "remaining_time": "1 day, 16:51:17"} {"current_steps": 30, "total_steps": 850, "loss": 0.2419, "learning_rate": 9.96929568447637e-06, "epoch": 0.1761467889908257, "percentage": 3.53, "elapsed_time": "1:27:57", "remaining_time": "1 day, 16:03:59"} {"current_steps": 40, "total_steps": 850, "loss": 0.2254, "learning_rate": 9.945458041855732e-06, "epoch": 0.23486238532110093, "percentage": 4.71, "elapsed_time": "1:57:32", "remaining_time": "1 day, 15:40:04"} {"current_steps": 50, "total_steps": 850, "loss": 0.213, "learning_rate": 9.91486549841951e-06, "epoch": 0.29357798165137616, "percentage": 5.88, "elapsed_time": "2:26:05", "remaining_time": "1 day, 14:57:27"} {"current_steps": 50, "total_steps": 850, "eval_loss": 0.20849353075027466, "epoch": 0.29357798165137616, "percentage": 5.88, "elapsed_time": "2:29:16", "remaining_time": "1 day, 15:48:20"} {"current_steps": 860, "total_steps": 1700, "loss": 0.1414, "learning_rate": 4.907605475204352e-06, "epoch": 5.058715596330275, "percentage": 50.59, "elapsed_time": "0:28:50", "remaining_time": "0:28:10"} {"current_steps": 870, "total_steps": 1700, "loss": 0.144, "learning_rate": 4.815242503054277e-06, "epoch": 5.1174311926605505, "percentage": 51.18, "elapsed_time": "0:57:42", "remaining_time": "0:55:02"} {"current_steps": 880, "total_steps": 1700, "loss": 0.1466, "learning_rate": 4.7229426254201504e-06, "epoch": 5.176146788990826, "percentage": 51.76, "elapsed_time": "1:26:22", "remaining_time": "1:20:28"} {"current_steps": 890, "total_steps": 1700, "loss": 0.1455, "learning_rate": 4.630737362625631e-06, "epoch": 5.234862385321101, "percentage": 52.35, "elapsed_time": "1:54:46", "remaining_time": "1:44:27"} {"current_steps": 900, "total_steps": 1700, "loss": 0.1445, "learning_rate": 4.53865820268349e-06, "epoch": 5.293577981651376, "percentage": 52.94, "elapsed_time": "2:23:34", "remaining_time": "2:07:37"} {"current_steps": 900, "total_steps": 1700, "eval_loss": 0.14977410435676575, "epoch": 5.293577981651376, "percentage": 52.94, "elapsed_time": "2:26:45", "remaining_time": "2:10:27"}

Others

No response

zhangfan-algo avatar May 20 '24 08:05 zhangfan-algo