LLM-Tuning
LLM-Tuning copied to clipboard
Why epoch in log is different from progress
Thanks for your work. I wanna ask a question why epoch in log is different from progress.
I have used the command to run the lora tuning with 8 gpus.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python chatglm2_lora_tuning.py \
--tokenized_dataset resultcombine \
--lora_rank 8 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 1 \
--max_steps 50753 \
--save_steps 10000 \
--save_total_limit 2 \
--learning_rate 1e-4 \
--fp16 \
--remove_unused_columns false \
--logging_steps 1000 \
--output_dir weights/resultcombine
My dataset's size is 162410. Now I have batch_size = per_device_train_batch_size * devices = 4*8 = 32. One iteration over dataset is 162410/32=5075.3125 steps. So I set the max_steps as 50753 to make 10 epochs. But I found that although my progress is nearly finished as followed.(50711/50753) But the epoch shows still 1.26.
{'loss': 0.0807, 'learning_rate': 1.1368786081610939e-05, 'epoch': 1.13}
{'loss': 0.079, 'learning_rate': 9.398459204382007e-06, 'epoch': 1.15}
{'loss': 0.08, 'learning_rate': 7.428132327153076e-06, 'epoch': 1.18}
{'loss': 0.0806, 'learning_rate': 5.459775776801371e-06, 'epoch': 1.21}
{'loss': 0.0805, 'learning_rate': 3.4894488995724393e-06, 'epoch': 1.23}
{'loss': 0.079, 'learning_rate': 1.5210923492207357e-06, 'epoch': 1.26}
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββ| 50711/50753 [36:56:34<01:53, 2.69s/it]
So I wanna ask was it normal? Does it mean that epoch only show one card info or there was something wrong? Thanks for your response.