addons
addons copied to clipboard
Triangular2/Exponential cyclical learning rates do not work when logging with Tensorboard
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux CentOS 8/Windows 11
- TensorFlow version and how it was installed (source or binary): 2.6.0 binary
- TensorFlow-Addons version and how it was installed (source or binary): 0.14.0 binary
- Python version: 3.8.6 (Linux)/3.9.7 (Windows)
- Is GPU used? (yes/no): yes
Describe the bug
When trying to train a model with the Triangular2 cyclical learning rate policy with scale_mode as 'iterations' and step_size equal to the number of steps in an epoch, if a Tensorboard callback is included while training the model, training stops with the following error after 1 epoch -
Traceback (most recent call last):
File "F:\error_example.py", line 49, in <module>
model.fit(ds_train, epochs=6, validation_data=ds_test, callbacks=[tensorboard_callback])
File "C:\Users\varun\mlenv\lib\site-packages\keras\engine\training.py", line 1230, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File "C:\Users\varun\mlenv\lib\site-packages\keras\callbacks.py", line 413, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "C:\Users\varun\mlenv\lib\site-packages\keras\callbacks.py", line 2444, in on_epoch_end
self._log_epoch_metrics(epoch, logs)
File "C:\Users\varun\mlenv\lib\site-packages\keras\callbacks.py", line 2492, in _log_epoch_metrics
train_logs = self._collect_learning_rate(train_logs)
File "C:\Users\varun\mlenv\lib\site-packages\keras\callbacks.py", line 2471, in _collect_learning_rate
logs['learning_rate'] = lr_schedule(self.model.optimizer.iterations)
File "C:\Users\varun\mlenv\lib\site-packages\tensorflow_addons\optimizers\cyclical_learning_rate.py", line 102, in __call__
) * tf.maximum(tf.cast(0, dtype), (1 - x)) * self.scale_fn(mode_step)
File "C:\Users\varun\mlenv\lib\site-packages\tensorflow_addons\optimizers\cyclical_learning_rate.py", line 238, in <lambda>
scale_fn=lambda x: 1 / (2.0 ** (x - 1)),
File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1399, in r_binary_op_wrapper
y, x = maybe_promote_tensors(y, x)
File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1335, in maybe_promote_tensors
ops.convert_to_tensor(tensor, dtype, name="x"))
File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\profiler\trace.py", line 163, in wrapped
return func(*args, **kwargs)
File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\ops.py", line 1566, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\tensor_conversion_registry.py", line 52, in _default_conversion_function
return constant_op.constant(value, dtype, name=name)
File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 271, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 283, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 308, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "C:\Users\varun\mlenv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 106, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
TypeError: Cannot convert 2.0 to EagerTensor of dtype int64
This error occurs irrespective of whether mixed precision was used or not. I was able to reproduce this error on both Linux and Windows environments. I attempted to use the nightly version of tensorflow-addons, but there was no luck. I also attempted defining my own lambda function for the scale_fn along with the CyclicalLearningRate class but the same error occurred.
This error is also confirmed to occur with the ExponentialCyclicLearningRate policy. Only the TriangularCyclicLearningRate policy runs without errors of any sort.
The error only occurs when attempting to log training; if the Tensorboard callback is not included, training proceeds without any issues. Even if write_graph is set to True, this problem occurs.
Code to reproduce the issue This code can reproduce the error -
import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow_datasets as tfds
from tensorflow.keras import mixed_precision
#policy = mixed_precision.Policy('mixed_float16')
#mixed_precision.set_global_policy(policy)
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
)
def normalize_img(image, label):
return tf.cast(image, tf.float32) / 255., label
ds_train = ds_train.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)
lr = tfa.optimizers.Triangular2CyclicalLearningRate(initial_learning_rate=0.001,
maximal_learning_rate=0.1,
step_size=200,
scale_mode='iterations')
log_dir = './logs/log_now'
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, update_freq=100,
write_graph=False)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, dtype=tf.float32)
])
model.compile(
optimizer=tf.keras.optimizers.SGD(lr),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]
)
model.fit(ds_train, epochs=6, validation_data=ds_test, callbacks=[tensorboard_callback])
This is a very weird bug; it'd be great if there was a workaround of some kind that does not prevent generation of logs.
Okay I can reproduce the bug .I will look at it .
Okay I have fixed it. I just need to merge the solution
@vulkomilev can I ask - why was this occurring, and how did you fix it?
I have this problem as well with ExponentialCyclicalLearningRate
will there be a fix? This is incredibly off putting when running several experiments trying out optimiser and learning rate details...