clearml Restarting training and reusing last task creates gaps in iteration axis in scalars section

Describe the bug

I'm training a model with tf.estimator API. Then I abort training and restart it while reusing last task. All the code that I added is

task = Task.init(project_name='OCR/CRNN',
                 task_type='training',
                 task_name='CRNN from scratch',
                 reuse_last_task_id=True,
                 continue_last_task=True)

After restarting training huge gaps appear in iteration axis (see the screenshot).

To reproduce

Use the following sample script:

import os
import dataclasses as dc
import numpy as np
import tensorflow as tf
import tensorflow.keras as tfk
from clearml import Task


(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)


@dc.dataclass
class Config:
    batch_size = 32
    learning_rate = 1e-4
    model_directory = '/path/to/mnist_estimator'


#%%
task = Task.init(project_name='tf.estimator/DNNClassifier-MNIST',
                 task_type='training',
                 task_name='DNNClassifier',
                 reuse_last_task_id=True,
                 continue_last_task=True)


os.environ['CUDA_VISIBLE_DEVICES'] = ''
config = Config()
feature_columns = [tf.feature_column.numeric_column("x", shape=[28, 28])]

classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=[256, 32],
    optimizer=tfk.optimizers.Adam(learning_rate=config.learning_rate),
    n_classes=10,
    dropout=0.1,
    config=tf.estimator.RunConfig(
        save_summary_steps=x_train.shape[0] / config.batch_size,
        save_checkpoints_secs=10,
        session_config=tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True)),
        log_step_count_steps=1000,
    ),
    model_dir=config.model_directory
)

train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={"x": x_train},
    y=y_train,
    num_epochs=None,
    batch_size=config.batch_size,
    shuffle=True,
)

test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={"x": x_test},
    y=y_test,
    num_epochs=1,
    shuffle=False
)

tf.estimator.train_and_evaluate(
    classifier,
    tf.estimator.TrainSpec(
        input_fn=train_input_fn
    ),
    tf.estimator.EvalSpec(
        input_fn=test_input_fn,
        steps=None,
        throttle_secs=5
    )
)

Train the model with tf.estimator API
Abort training
Continue Training
Huge gaps appear in iteration axis.

Expected behaviour

Graphs don't contain huge gaps and iteration number (global step in this case) is correctly obtained.

Environment

Server type: self hosted
ClearML SDK Version: 1.6.4
ClearML Server Version: WebApp: 1.6.0-213 • Server: 1.6.0-213 • API: 2.20
Python Version: 3.9.12
OS: Linux

Aug 31 '22 08:08 antonlukyanov

Hi @antonlukyanov,

I've tried to reproduce your scenario with a simple script and couldn't, I used this: from clearml import Task, Logger from time import sleep import random t = Task.init(project_name='tests',task_name='continue test',reuse_last_task_id=True, continue_last_task=True) l = t.get_logger()

print('initial iteration {} last iteration {}'.format(t.get_initial_iteration(), t.get_last_iteration()))

for i in range(1,1000000): print(i) l.report_scalar(title='my_title',series='my_series',value=i+random.randrange(0,5),iteration=i) sleep(0.001)

print('initial iteration {} last iteration {}'.format(task.get_initial_iteration(), task.get_last_iteration()))

Can you also try to add this print after Task.init and see if iterations make sense when resuming?

Lastly, I tried looking for an example code for tf estimators and found only linear regression one, any easy example I can try to reproduce with?

Sep 01 '22 09:09 erezalg

Hi @erezalg Thanks for the reply. I personally noticed such behaviour with estimators whereas your code doesn't use them. Also it happens when training is aborted and resumed by running the same script again, not put into sleep. Let me come up with sample code a bit later.

Sep 02 '22 14:09 antonlukyanov

@erezalg Here's the script to train DNNClassifier on MNIST data which reproduces the bug. TensorFlow version is 2.9.

import os
import dataclasses as dc
import numpy as np
import tensorflow as tf
import tensorflow.keras as tfk
from clearml import Task


(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)


@dc.dataclass
class Config:
    batch_size = 32
    learning_rate = 1e-4
    model_directory = '/path/to/mnist_estimator'


#%%
task = Task.init(project_name='tf.estimator/DNNClassifier-MNIST',
                 task_type='training',
                 task_name='DNNClassifier',
                 reuse_last_task_id=True,
                 continue_last_task=True)


os.environ['CUDA_VISIBLE_DEVICES'] = ''
config = Config()
feature_columns = [tf.feature_column.numeric_column("x", shape=[28, 28])]

classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=[256, 32],
    optimizer=tfk.optimizers.Adam(learning_rate=config.learning_rate),
    n_classes=10,
    dropout=0.1,
    config=tf.estimator.RunConfig(
        save_summary_steps=x_train.shape[0] / config.batch_size,
        save_checkpoints_secs=10,
        session_config=tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True)),
        log_step_count_steps=1000,
    ),
    model_dir=config.model_directory
)

train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={"x": x_train},
    y=y_train,
    num_epochs=None,
    batch_size=config.batch_size,
    shuffle=True,
)

test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={"x": x_test},
    y=y_test,
    num_epochs=1,
    shuffle=False
)

tf.estimator.train_and_evaluate(
    classifier,
    tf.estimator.TrainSpec(
        input_fn=train_input_fn
    ),
    tf.estimator.EvalSpec(
        input_fn=test_input_fn,
        steps=None,
        throttle_secs=5
    )
)

Sep 05 '22 17:09 antonlukyanov

Hi @antonlukyanov,

Thanks for the code, we now are able to reproduce the issue. Will let you know once this issue is resolved

Sep 06 '22 08:09 erezalg

clearml clearml copied to clipboard

Restarting training and reusing last task creates gaps in iteration axis in scalars section

Describe the bug

To reproduce

Expected behaviour

Environment

clearml
clearml copied to clipboard