bilm-tf
bilm-tf copied to clipboard
Resume ELMo training after crash
Hello,
I'm currently trying to train ELMo with my own data, but sadly the process has crashed (cluster problem, nothing to do with the code). Since I have the checkpoints I don't want to loose days of training. However when I tried the restart.py
the perplexity jumped way up and it actually seems to me that it just started reading the data from the beginning once again, after all if I understood correctly the restart.py
is intended for fine-tuning, not for resuming a traning after a crash. Then I saw that in bilm/training.py
line 675 where the training function is provided, one can pass the checkpoint:
def train(options, data, n_gpus, tf_save_dir, tf_log_dir,
restart_ckpt_file=None):
and actually in line 770 of the same file, the checkpoint appear to be loaded (provided it is passed to the function):
if restart_ckpt_file is not None:
loader = tf.train.Saver()
loader.restore(sess, restart_ckpt_file)
However in the bin/train_elmo.py
there where the train function is called on line 63, the checkpoint file is not specified:
train(options, data, n_gpus, tf_save_dir, tf_log_dir)
Can I resume my training just putting the checkpoint there at the end? Do I have to do something else to resume training? Is it even possible to resume training without affecting perplexity?
Thank you in advance.
@pjox Have you found the solution?
It seems we need to fix the code in bin/train_elmo.py with providing explicit restart_ckpt_file
argument.