addons icon indicating copy to clipboard operation
addons copied to clipboard

Triangular2/Exponential cyclical learning rates do not work when logging with Tensorboard

Open varun-parthasarathy opened this issue 3 years ago • 4 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux CentOS 8/Windows 11
  • TensorFlow version and how it was installed (source or binary): 2.6.0 binary
  • TensorFlow-Addons version and how it was installed (source or binary): 0.14.0 binary
  • Python version: 3.8.6 (Linux)/3.9.7 (Windows)
  • Is GPU used? (yes/no): yes

Describe the bug

When trying to train a model with the Triangular2 cyclical learning rate policy with scale_mode as 'iterations' and step_size equal to the number of steps in an epoch, if a Tensorboard callback is included while training the model, training stops with the following error after 1 epoch -

Traceback (most recent call last):
  File "F:\error_example.py", line 49, in <module>
    model.fit(ds_train, epochs=6, validation_data=ds_test, callbacks=[tensorboard_callback])
  File "C:\Users\varun\mlenv\lib\site-packages\keras\engine\training.py", line 1230, in fit
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "C:\Users\varun\mlenv\lib\site-packages\keras\callbacks.py", line 413, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "C:\Users\varun\mlenv\lib\site-packages\keras\callbacks.py", line 2444, in on_epoch_end
    self._log_epoch_metrics(epoch, logs)
  File "C:\Users\varun\mlenv\lib\site-packages\keras\callbacks.py", line 2492, in _log_epoch_metrics
    train_logs = self._collect_learning_rate(train_logs)
  File "C:\Users\varun\mlenv\lib\site-packages\keras\callbacks.py", line 2471, in _collect_learning_rate
    logs['learning_rate'] = lr_schedule(self.model.optimizer.iterations)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow_addons\optimizers\cyclical_learning_rate.py", line 102, in __call__
    ) * tf.maximum(tf.cast(0, dtype), (1 - x)) * self.scale_fn(mode_step)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow_addons\optimizers\cyclical_learning_rate.py", line 238, in <lambda>
    scale_fn=lambda x: 1 / (2.0 ** (x - 1)),
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1399, in r_binary_op_wrapper
    y, x = maybe_promote_tensors(y, x)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1335, in maybe_promote_tensors
    ops.convert_to_tensor(tensor, dtype, name="x"))
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\profiler\trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\ops.py", line 1566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 271, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 283, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 308, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 106, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
TypeError: Cannot convert 2.0 to EagerTensor of dtype int64

This error occurs irrespective of whether mixed precision was used or not. I was able to reproduce this error on both Linux and Windows environments. I attempted to use the nightly version of tensorflow-addons, but there was no luck. I also attempted defining my own lambda function for the scale_fn along with the CyclicalLearningRate class but the same error occurred.

This error is also confirmed to occur with the ExponentialCyclicLearningRate policy. Only the TriangularCyclicLearningRate policy runs without errors of any sort.

The error only occurs when attempting to log training; if the Tensorboard callback is not included, training proceeds without any issues. Even if write_graph is set to True, this problem occurs.

Code to reproduce the issue This code can reproduce the error -

import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow_datasets as tfds
from tensorflow.keras import mixed_precision

#policy = mixed_precision.Policy('mixed_float16')
#mixed_precision.set_global_policy(policy)

(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)
def normalize_img(image, label):
  return tf.cast(image, tf.float32) / 255., label

ds_train = ds_train.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)

lr = tfa.optimizers.Triangular2CyclicalLearningRate(initial_learning_rate=0.001,
                                                    maximal_learning_rate=0.1,
                                                    step_size=200,
                                                    scale_mode='iterations')
log_dir = './logs/log_now'
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, update_freq=100,
                                                                  write_graph=False)

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10, dtype=tf.float32)
])
model.compile(
    optimizer=tf.keras.optimizers.SGD(lr),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]
)

model.fit(ds_train, epochs=6, validation_data=ds_test, callbacks=[tensorboard_callback])

This is a very weird bug; it'd be great if there was a workaround of some kind that does not prevent generation of logs.

varun-parthasarathy avatar Oct 31 '21 06:10 varun-parthasarathy

Okay I can reproduce the bug .I will look at it .

vulkomilev avatar Nov 30 '21 16:11 vulkomilev

Okay I have fixed it. I just need to merge the solution

vulkomilev avatar Nov 30 '21 16:11 vulkomilev

@vulkomilev can I ask - why was this occurring, and how did you fix it?

varun-parthasarathy avatar Dec 20 '21 17:12 varun-parthasarathy

I have this problem as well with ExponentialCyclicalLearningRate will there be a fix? This is incredibly off putting when running several experiments trying out optimiser and learning rate details...

romanovzky avatar Sep 21 '22 08:09 romanovzky