LLM-Tuning icon indicating copy to clipboard operation
LLM-Tuning copied to clipboard

Why epoch in log is different from progress

Open jimmy-walker opened this issue 1 year ago β€’ 0 comments

Thanks for your work. I wanna ask a question why epoch in log is different from progress.

I have used the command to run the lora tuning with 8 gpus.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python chatglm2_lora_tuning.py \
    --tokenized_dataset resultcombine \
    --lora_rank 8 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --max_steps 50753 \
    --save_steps 10000 \
    --save_total_limit 2 \
    --learning_rate 1e-4 \
    --fp16 \
    --remove_unused_columns false \
    --logging_steps 1000 \
    --output_dir weights/resultcombine 

My dataset's size is 162410. Now I have batch_size = per_device_train_batch_size * devices = 4*8 = 32. One iteration over dataset is 162410/32=5075.3125 steps. So I set the max_steps as 50753 to make 10 epochs. But I found that although my progress is nearly finished as followed.(50711/50753) But the epoch shows still 1.26.

{'loss': 0.0807, 'learning_rate': 1.1368786081610939e-05, 'epoch': 1.13}
{'loss': 0.079, 'learning_rate': 9.398459204382007e-06, 'epoch': 1.15}
{'loss': 0.08, 'learning_rate': 7.428132327153076e-06, 'epoch': 1.18}
{'loss': 0.0806, 'learning_rate': 5.459775776801371e-06, 'epoch': 1.21}
{'loss': 0.0805, 'learning_rate': 3.4894488995724393e-06, 'epoch': 1.23}
{'loss': 0.079, 'learning_rate': 1.5210923492207357e-06, 'epoch': 1.26}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 50711/50753 [36:56:34<01:53,  2.69s/it]

So I wanna ask was it normal? Does it mean that epoch only show one card info or there was something wrong? Thanks for your response.

jimmy-walker avatar Dec 20 '23 03:12 jimmy-walker