seq2seq icon indicating copy to clipboard operation
seq2seq copied to clipboard

Bug: Function input_fn() creates new dataset on every call, leads to model overfitting

Open dagarcia-nvidia opened this issue 7 years ago • 3 comments

Our team has found what appears to be a major bug that causes models to be trained with a small subset of the full dataset specified by the user. This often leads to models that overfit.

The issue is caused by the use of tensorflow's continuous_train_and_eval() function combined with how input_fn() is implemented in tf-seq2seq. The smaller is the value of FLAGS.eval_every_n_steps, the faster overfitting occurs.

This is how it happens:

https://github.com/google/seq2seq/blob/master/bin/train.py#L269 function main() calls function learn_runner.run(), which goes to a bunch of tensorflow code and eventually makes it to…

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/experiment.py#L506 function continuous_train_and_eval(), which interleaves training and evaluation. The training portion eventually goes into…

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/experiment.py#L647 function _train_model(), which calls function input_fn(), which goes into…

https://github.com/google/seq2seq/blob/master/seq2seq/training/utils.py#L255 function input_fn() calls pipeline.make_data_provider(). In the case of this this model, that goes into…

https://github.com/google/seq2seq/blob/master/seq2seq/data/input_pipeline.py#L145 which creates a brand new tf.contrib.slim.dataset.Dataset().

This causes the effective size of the dataset to be as small as FLAGS.eval_every_n_steps, as the new Dataset() always starts reading from the beginning.

The problem is exacerbated when the input pipeline is set up with shuffle=False.

We reproduced this problem by training a small_nmt model with 8 layers, but it should affect more or less any model trained with tf-seq2seq. You can reproduce the issue by setting FLAGS.eval_every_n_steps equal to 1K in one instance, and 10K in another instance. If you look at the testing loss it will be clear that the smaller FLAGS.eval_every_n_steps, the faster it overfits.

It seems like the problem could be avoided if input_fn() was memoized.

Thanks for looking into this.

dagarcia-nvidia avatar Jun 21 '17 18:06 dagarcia-nvidia

We currently think that memoizing input_fn() won't be enough because every time continuous_train_and_eval() switches and calls _call_train(), it eventually goes into train_model, which creates a brand new graph.

We have created a bug against Tensorflow itself related to this issue, as we are not 100% certain of whether the problem could have been avoided through cleverer client code here.

dagarcia-nvidia avatar Jun 23 '17 14:06 dagarcia-nvidia

@dagarcia-nvidia I think continuous_train_and_eval is not a good way to train model faster, what I do now is train model only with gpu, and make another running parallely eval the model with cpu.

liyi193328 avatar Jul 03 '17 13:07 liyi193328

@dagarcia-nvidia I think continuous_train_and_eval is not a good way to train model faster, what I do now is train model only with gpu, and make another running parallely eval the model with cpu.

The evaluate function and the continuous_eval function in Experiment seems not work, could you please tell me how do you eval the model?

ghtwht avatar Apr 23 '19 03:04 ghtwht