fix: estimate_mfu dt ZeroDivisionError

Open HildaM opened this issue 1 year ago • 0 comments

Previous estimate_mfu function has ZeroDivisionError error

In model.py 301 line, flops_achieved = flops_per_iter * (1.0/dt) will occur ZeroDivisionError, which means dt will be Zero when the time interval between two consecutive calls to time.time() is so small that it is considered as 0 under floating point precision.

replicate the problem

iter 800: loss 1.4306, time 20.79ms, mfu 18.58%
iter 810: loss 1.4020, time 31.59ms, mfu 17.90%
iter 820: loss 1.4028, time 15.12ms, mfu 18.58%
iter 830: loss 1.3907, time 17.64ms, mfu 18.83%
Traceback (most recent call last):
  File "D:\Coding\AILearning\LLM\LLM_Learning\nanoGPT\train.py", line 325, in <module>
    mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Coding\AILearning\LLM\LLM_Learning\nanoGPT\model.py", line 302, in estimate_mfu
    flops_achieved = flops_per_iter * (1.0/dt) # per second
                                       ~~~^~~
ZeroDivisionError: float division by zero

I am training on my single 4090 card, and every time I start the training code it will occur ZeroDivisionError.

Mar 02 '24 03:03 HildaM