tf-keras icon indicating copy to clipboard operation
tf-keras copied to clipboard

Incorrect time per step estimation when fitting model

Open koalive opened this issue 3 years ago • 4 comments

Hi there,

System information. I observed the behavior both in a Colab notebook (TF v2.8.0-0-g3f878cff5b6 2.8.0) and in a custom Docker image (Ubuntu 20.04, TF v2.5.1-97-g957590ea15c 2.5.2). The issue can easily be reproduced by using validation sets of different sizes and I made an example notebook (see below).

Describe the problem. The time/step reported when calling fit is not the time per training step. The phrasing can be misleading, for instance when trying to design your training scheme based on time constraints.

Describe the current behavior. Currently, the time for a full epoch (including validation) divided by the number of training steps is reported. If the validation takes a significant amount of time, the time per training step might be way smaller than reported.

Describe the expected behavior. I feel that reporting the time per training step (excluding validation) would be more informative. For instance for the following output (from the Colab notebook linked below):

Epoch 1/3
5/5 [==============================] - 2s 613ms/step - loss: 0.3212 - val_loss: 0.1224

I would expect that using a single step instead of 5 would make each epoch take around 613ms. However each epoch would still take 2s as most of it is spent on the validation set.

Standalone code to reproduce the issue.

See this colab notebook: https://colab.research.google.com/drive/1YGWstYcbFwkPY4ezZ2C-krDPahS36nnm?usp=sharing

Cheers.

koalive avatar Mar 22 '22 19:03 koalive

i understand what your implying when it comes to giving simpler yet efficient movement in a small amount of time instead of trying to validate the move itself which giving my full focus on the movement for dynamics

fig666 avatar Mar 23 '22 09:03 fig666

This is most unlikely scenario in general use case and currently time/step which is averaging out the time taken per step of training data and validation data makes sense.

sachinprasadhs avatar Mar 23 '22 23:03 sachinprasadhs

The time per epoch (including validation) is also reported, so other use cases might already be covered by this value (2s in my previous quote). I guess it's a matter of opinion, if read "time/step" I expect it to represent the time taken by each training step, but the community at large might have other expectations.

koalive avatar Mar 24 '22 16:03 koalive

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] avatar Mar 31 '22 16:03 google-ml-butler[bot]

@koalive, I tried to execute the mentioned code on latest Keras3.0 and observed confirmed that the per-step time estimation remained stable. Kindly find the gist of it here. Thank you!

tilakrayal avatar Apr 01 '25 06:04 tilakrayal

@tilakrayal I just ran your example with keras 3.8.0 and indeed, the behavior is unchanged. The fact that the reported "time/step" changes when the time taken per step does not actually change (model and input data are the same) but only the total runtime differs (due to the validation data) still feels wrong to me.

koalive avatar Apr 14 '25 15:04 koalive

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] avatar Apr 29 '25 02:04 github-actions[bot]

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.

github-actions[bot] avatar May 15 '25 02:05 github-actions[bot]