clearml icon indicating copy to clipboard operation
clearml copied to clipboard

Restarting training and reusing last task creates gaps in iteration axis in scalars section

Open antonlukyanov opened this issue 2 years ago • 4 comments

Describe the bug

I'm training a model with tf.estimator API. Then I abort training and restart it while reusing last task. All the code that I added is

task = Task.init(project_name='OCR/CRNN',
                 task_type='training',
                 task_name='CRNN from scratch',
                 reuse_last_task_id=True,
                 continue_last_task=True)

After restarting training huge gaps appear in iteration axis (see the screenshot).

image

To reproduce

  1. Use the following sample script:
import os
import dataclasses as dc
import numpy as np
import tensorflow as tf
import tensorflow.keras as tfk
from clearml import Task


(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)


@dc.dataclass
class Config:
    batch_size = 32
    learning_rate = 1e-4
    model_directory = '/path/to/mnist_estimator'


#%%
task = Task.init(project_name='tf.estimator/DNNClassifier-MNIST',
                 task_type='training',
                 task_name='DNNClassifier',
                 reuse_last_task_id=True,
                 continue_last_task=True)


os.environ['CUDA_VISIBLE_DEVICES'] = ''
config = Config()
feature_columns = [tf.feature_column.numeric_column("x", shape=[28, 28])]

classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=[256, 32],
    optimizer=tfk.optimizers.Adam(learning_rate=config.learning_rate),
    n_classes=10,
    dropout=0.1,
    config=tf.estimator.RunConfig(
        save_summary_steps=x_train.shape[0] / config.batch_size,
        save_checkpoints_secs=10,
        session_config=tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True)),
        log_step_count_steps=1000,
    ),
    model_dir=config.model_directory
)

train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={"x": x_train},
    y=y_train,
    num_epochs=None,
    batch_size=config.batch_size,
    shuffle=True,
)

test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={"x": x_test},
    y=y_test,
    num_epochs=1,
    shuffle=False
)

tf.estimator.train_and_evaluate(
    classifier,
    tf.estimator.TrainSpec(
        input_fn=train_input_fn
    ),
    tf.estimator.EvalSpec(
        input_fn=test_input_fn,
        steps=None,
        throttle_secs=5
    )
)
  1. Train the model with tf.estimator API
  2. Abort training
  3. Continue Training
  4. Huge gaps appear in iteration axis.

Expected behaviour

Graphs don't contain huge gaps and iteration number (global step in this case) is correctly obtained.

Environment

  • Server type: self hosted
  • ClearML SDK Version: 1.6.4
  • ClearML Server Version: WebApp: 1.6.0-213 • Server: 1.6.0-213 • API: 2.20
  • Python Version: 3.9.12
  • OS: Linux

antonlukyanov avatar Aug 31 '22 08:08 antonlukyanov

Hi @antonlukyanov,

I've tried to reproduce your scenario with a simple script and couldn't, I used this: from clearml import Task, Logger from time import sleep import random t = Task.init(project_name='tests',task_name='continue test',reuse_last_task_id=True, continue_last_task=True) l = t.get_logger()

print('initial iteration {} last iteration {}'.format(t.get_initial_iteration(), t.get_last_iteration()))

for i in range(1,1000000): print(i) l.report_scalar(title='my_title',series='my_series',value=i+random.randrange(0,5),iteration=i) sleep(0.001)

print('initial iteration {} last iteration {}'.format(task.get_initial_iteration(), task.get_last_iteration()))

Can you also try to add this print after Task.init and see if iterations make sense when resuming?

Lastly, I tried looking for an example code for tf estimators and found only linear regression one, any easy example I can try to reproduce with?

erezalg avatar Sep 01 '22 09:09 erezalg

Hi @erezalg Thanks for the reply. I personally noticed such behaviour with estimators whereas your code doesn't use them. Also it happens when training is aborted and resumed by running the same script again, not put into sleep. Let me come up with sample code a bit later.

antonlukyanov avatar Sep 02 '22 14:09 antonlukyanov

@erezalg Here's the script to train DNNClassifier on MNIST data which reproduces the bug. TensorFlow version is 2.9.

import os
import dataclasses as dc
import numpy as np
import tensorflow as tf
import tensorflow.keras as tfk
from clearml import Task


(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)


@dc.dataclass
class Config:
    batch_size = 32
    learning_rate = 1e-4
    model_directory = '/path/to/mnist_estimator'


#%%
task = Task.init(project_name='tf.estimator/DNNClassifier-MNIST',
                 task_type='training',
                 task_name='DNNClassifier',
                 reuse_last_task_id=True,
                 continue_last_task=True)


os.environ['CUDA_VISIBLE_DEVICES'] = ''
config = Config()
feature_columns = [tf.feature_column.numeric_column("x", shape=[28, 28])]

classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=[256, 32],
    optimizer=tfk.optimizers.Adam(learning_rate=config.learning_rate),
    n_classes=10,
    dropout=0.1,
    config=tf.estimator.RunConfig(
        save_summary_steps=x_train.shape[0] / config.batch_size,
        save_checkpoints_secs=10,
        session_config=tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True)),
        log_step_count_steps=1000,
    ),
    model_dir=config.model_directory
)

train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={"x": x_train},
    y=y_train,
    num_epochs=None,
    batch_size=config.batch_size,
    shuffle=True,
)

test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={"x": x_test},
    y=y_test,
    num_epochs=1,
    shuffle=False
)

tf.estimator.train_and_evaluate(
    classifier,
    tf.estimator.TrainSpec(
        input_fn=train_input_fn
    ),
    tf.estimator.EvalSpec(
        input_fn=test_input_fn,
        steps=None,
        throttle_secs=5
    )
)

image

antonlukyanov avatar Sep 05 '22 17:09 antonlukyanov

Hi @antonlukyanov,

Thanks for the code, we now are able to reproduce the issue. Will let you know once this issue is resolved

erezalg avatar Sep 06 '22 08:09 erezalg