seq2seq
seq2seq copied to clipboard
Bug: Function input_fn() creates new dataset on every call, leads to model overfitting
Our team has found what appears to be a major bug that causes models to be trained with a small subset of the full dataset specified by the user. This often leads to models that overfit.
The issue is caused by the use of tensorflow's continuous_train_and_eval() function combined with how input_fn() is implemented in tf-seq2seq. The smaller is the value of FLAGS.eval_every_n_steps, the faster overfitting occurs.
This is how it happens:
https://github.com/google/seq2seq/blob/master/bin/train.py#L269 function main() calls function learn_runner.run(), which goes to a bunch of tensorflow code and eventually makes it to…
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/experiment.py#L506 function continuous_train_and_eval(), which interleaves training and evaluation. The training portion eventually goes into…
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/experiment.py#L647 function _train_model(), which calls function input_fn(), which goes into…
https://github.com/google/seq2seq/blob/master/seq2seq/training/utils.py#L255 function input_fn() calls pipeline.make_data_provider(). In the case of this this model, that goes into…
https://github.com/google/seq2seq/blob/master/seq2seq/data/input_pipeline.py#L145 which creates a brand new tf.contrib.slim.dataset.Dataset().
This causes the effective size of the dataset to be as small as FLAGS.eval_every_n_steps, as the new Dataset() always starts reading from the beginning.
The problem is exacerbated when the input pipeline is set up with shuffle=False.
We reproduced this problem by training a small_nmt model with 8 layers, but it should affect more or less any model trained with tf-seq2seq. You can reproduce the issue by setting FLAGS.eval_every_n_steps equal to 1K in one instance, and 10K in another instance. If you look at the testing loss it will be clear that the smaller FLAGS.eval_every_n_steps, the faster it overfits.
It seems like the problem could be avoided if input_fn() was memoized.
Thanks for looking into this.
We currently think that memoizing input_fn() won't be enough because every time continuous_train_and_eval() switches and calls _call_train(), it eventually goes into train_model, which creates a brand new graph.
We have created a bug against Tensorflow itself related to this issue, as we are not 100% certain of whether the problem could have been avoided through cleverer client code here.
@dagarcia-nvidia I think continuous_train_and_eval is not a good way to train model faster, what I do now is train model only with gpu, and make another running parallely eval the model with cpu.
@dagarcia-nvidia I think continuous_train_and_eval is not a good way to train model faster, what I do now is train model only with gpu, and make another running parallely eval the model with cpu.
The evaluate function and the continuous_eval function in Experiment seems not work, could you please tell me how do you eval the model?