nanoGPT
nanoGPT copied to clipboard
fix: estimate_mfu dt ZeroDivisionError
Previous estimate_mfu function has ZeroDivisionError error
In model.py 301 line, flops_achieved = flops_per_iter * (1.0/dt) will occur ZeroDivisionError, which means dt will be Zero when the time interval between two consecutive calls to time.time() is so small that it is considered as 0 under floating point precision.
replicate the problem
iter 800: loss 1.4306, time 20.79ms, mfu 18.58%
iter 810: loss 1.4020, time 31.59ms, mfu 17.90%
iter 820: loss 1.4028, time 15.12ms, mfu 18.58%
iter 830: loss 1.3907, time 17.64ms, mfu 18.83%
Traceback (most recent call last):
File "D:\Coding\AILearning\LLM\LLM_Learning\nanoGPT\train.py", line 325, in <module>
mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Coding\AILearning\LLM\LLM_Learning\nanoGPT\model.py", line 302, in estimate_mfu
flops_achieved = flops_per_iter * (1.0/dt) # per second
~~~^~~
ZeroDivisionError: float division by zero
I am training on my single 4090 card, and every time I start the training code it will occur ZeroDivisionError.