adanet
adanet copied to clipboard
Evaluation issue using TPUEstimator
Running into an issue when using Adanet TPUEstimator. Say, for example, the estimator is configured with max_iteration_steps=500 and it is desired to evaluate the model's performance during training after every 100 training steps (i.e. steps_per_evaluation=100) for 2 complete Adanet iterations.
To achieve this, estimator.train(max_steps, train_input) followed by estimator.evaluate(eval_input) are run in a loop, while incrementing max_steps by steps_per_evaluation number of steps at the end of each loop, until max_steps=1000 is reached (i.e. corresponding to 2 complete Adanet iterations)
When running in local mode (i.e. use_tpu=False), training proceeds as expected. That is, training proceeds for 2 complete Adanet iterations (i.e. steps 0 to 500 for the first iteration and steps 500 to 1000 for the second iteration, with evaluation every 100 steps). However, when running on CloudTPU (i.e. use_tpu=True), training reaches max_steps=1000 without ever progressing to a second iteration.
On the other hand, a single call of estimator.train(max_steps=1000, train_input) using CloudTPU without the estimator.evaluate results in 2 complete Adanet iterations as expected. This makes me think the issue lies with the evaluation call? What could the issue be? If this is a TPUEstimator related issue, am I then constrained to the standard Estimator if I want this kind of train-evaluation loop configuration?
@nicholasbreckwoldt: We just released adanet=0.9.0 which includes better TPU, and TF 2 support. Please try installing it, and let us know if it resolves your issue.
@cweill Thanks for the update! I am running into a new issue with the upgrade to TF 2.2 and adanet==0.9.0 which has so far prevented me from establishing whether the above evaluation issue has been resolved. I've added a description of this new issue (#157).