tensorboard icon indicating copy to clipboard operation
tensorboard copied to clipboard

Tensorboard epoch summaries dependent on divisibility by update_freq

Open atyshka opened this issue 3 years ago • 4 comments

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): 2.4.1
  • Python version: 3.8
  • CUDA/cuDNN version: 11.0/8.0
  • GPU model and memory: 2x RTX A6000

Describe the current behavior I am training on a large dataset, and for tensorboard I would like to evaluate certain metrics such as loss at a higher frequency, while other metrics such as test images I would like to evaluate more infrequently, such as at the end of a batch. To use a more frequent loss, I set update_freq on the keras Tensorboard callback to be a reasonably small number so as to get frequent updates. Then to log my images, I setup a lambda callback on epoch end. However, I was confused that the images were never being written. I eventually discovered that the on_epoch_end lambda callback will not be able to log anything via tf.summary unless the dataset length is divisible by the update_freq. For example, if my update_freq was 5, and my dataset was 10 steps long, the epoch callbacks would be able to write, but if the dataset was 11 steps long, the callback would not fire since 11 is not a multiple of 5. Since my lambda callback and tensorboard callbacks are separate objects, it is not intuitive that the update_freq of the tensorboard callback would affect the lambda callback. Of course the simple solution is to truncate my dataset to a number divisible by update_freq, but that is both non-intuitive and discards valuable data.

Describe the expected behavior I would expect that lambda callbacks should fire on_epoch_end events regardless of update_freq, or at the very least that this behavior be documented to save others time if they encounter this issue.

Standalone code to reproduce the issue

tboard_callback = tf.keras.callbacks.TensorBoard(log_dir = logs, update_freq=5)
file_writer = tf.summary.create_file_writer('logs/images')
def evaluate_images(epoch, logs):
    print("Writing images")
    with file_writer.as_default():
        print(tf.summary.image("Prediction", tf.zeros([1, 100, 100, 1])))
        # Prints false if images are not written because epoch length not divisible by update_freq
model.fit(dataset, epochs=1, callbacks = [tboard_callback, tf.keras.callbacks.LambdaCallback(on_epoch_end=evaluate_images)])

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

atyshka avatar May 11 '21 15:05 atyshka

@atyshka ,

In order to reproduce the issue reported here, could you please provide the complete code and the dataset you are using. Thanks!

tilakrayal avatar May 12 '21 10:05 tilakrayal

Thanks for the report, and the clear steps. I'm not able to reproduce this yet. Taking the example Colab from https://colab.sandbox.google.com/github/tensorflow/datasets/blob/master/docs/keras_example.ipynb and adding a model callback with update_freq=5 and running for 11 epochs:

tboard_callback = tf.keras.callbacks.TensorBoard(log_dir = logs, update_freq=5)
file_writer = tf.summary.create_file_writer(logs)
def evaluate_images(epoch, logs):
    print("Writing images")
    with file_writer.as_default():
        print(tf.summary.image("Prediction", tf.zeros([1, 100, 100, 1])))
model.fit(
    ds_train,
    epochs=11,
    validation_data=ds_test,
    callbacks = [tboard_callback, tf.keras.callbacks.LambdaCallback(on_epoch_end=evaluate_images)]
)

I'm still able to see images written. image

For this case of epochs=11, update_freq=5, maybe I'm missing something?

However, when the # of epochs is < update_freq, e.g. epochs=1, update_freq=5, I can confirm that no images are written. For the latter, we can update the docs to make this clearer.

psybuzz avatar May 19 '21 22:05 psybuzz

@psybuzz I think you misinterpreted what I mean by divisibility. Not 11 epochs, but 11 steps. Using that same colab notebook you used, change the epochs to 1. Because the dataset is 469 steps long, it is not divisible by 5 and will not write images. The reason yours worked, is because every 5 epochs the number of steps does become divisible by 5. So images are written during the 5th and 10th epoch but not during any others. See the example output here (notice False everywhere other than epochs 5 and 10)

Epoch 1/11
469/469 [==============================] - 3s 5ms/step - loss: 0.6077 - sparse_categorical_accuracy: 0.8307 - val_loss: 0.1964 - val_sparse_categorical_accuracy: 0.9458
Writing images
tf.Tensor(False, shape=(), dtype=bool)
Epoch 2/11
469/469 [==============================] - 1s 3ms/step - loss: 0.1810 - sparse_categorical_accuracy: 0.9503 - val_loss: 0.1398 - val_sparse_categorical_accuracy: 0.9580
Writing images
tf.Tensor(False, shape=(), dtype=bool)
Epoch 3/11
469/469 [==============================] - 1s 3ms/step - loss: 0.1253 - sparse_categorical_accuracy: 0.9642 - val_loss: 0.1174 - val_sparse_categorical_accuracy: 0.9661
Writing images
tf.Tensor(False, shape=(), dtype=bool)
Epoch 4/11
469/469 [==============================] - 1s 3ms/step - loss: 0.0935 - sparse_categorical_accuracy: 0.9726 - val_loss: 0.0937 - val_sparse_categorical_accuracy: 0.9703
Writing images
tf.Tensor(False, shape=(), dtype=bool)
Epoch 5/11
469/469 [==============================] - 1s 3ms/step - loss: 0.0747 - sparse_categorical_accuracy: 0.9792 - val_loss: 0.0954 - val_sparse_categorical_accuracy: 0.9723
Writing images
tf.Tensor(True, shape=(), dtype=bool)
Epoch 6/11
469/469 [==============================] - 1s 3ms/step - loss: 0.0604 - sparse_categorical_accuracy: 0.9831 - val_loss: 0.0863 - val_sparse_categorical_accuracy: 0.9732
Writing images
tf.Tensor(False, shape=(), dtype=bool)
Epoch 7/11
469/469 [==============================] - 1s 3ms/step - loss: 0.0523 - sparse_categorical_accuracy: 0.9847 - val_loss: 0.0764 - val_sparse_categorical_accuracy: 0.9769
Writing images
tf.Tensor(False, shape=(), dtype=bool)
Epoch 8/11
469/469 [==============================] - 1s 3ms/step - loss: 0.0403 - sparse_categorical_accuracy: 0.9890 - val_loss: 0.0813 - val_sparse_categorical_accuracy: 0.9750
Writing images
tf.Tensor(False, shape=(), dtype=bool)
Epoch 9/11
469/469 [==============================] - 1s 3ms/step - loss: 0.0349 - sparse_categorical_accuracy: 0.9904 - val_loss: 0.0744 - val_sparse_categorical_accuracy: 0.9767
Writing images
tf.Tensor(False, shape=(), dtype=bool)
Epoch 10/11
469/469 [==============================] - 1s 3ms/step - loss: 0.0289 - sparse_categorical_accuracy: 0.9921 - val_loss: 0.0724 - val_sparse_categorical_accuracy: 0.9781
Writing images
tf.Tensor(True, shape=(), dtype=bool)
Epoch 11/11
469/469 [==============================] - 1s 3ms/step - loss: 0.0240 - sparse_categorical_accuracy: 0.9941 - val_loss: 0.0801 - val_sparse_categorical_accuracy: 0.9755
Writing images
tf.Tensor(False, shape=(), dtype=bool)
<tensorflow.python.keras.callbacks.History at 0x7f760ad61790>

atyshka avatar May 24 '21 02:05 atyshka

I stumbled across exactly the same issue as described by @atyshka using TensorFlow version 2.14.0.

parzivalN avatar Dec 07 '23 15:12 parzivalN