tfx icon indicating copy to clipboard operation
tfx copied to clipboard

Tuner intermittently failing

Open nroberts1 opened this issue 2 years ago • 5 comments

If the bug is related to a specific library below, please raise an issue in the respective repo directly:

TensorFlow Data Validation Repo

TensorFlow Model Analysis Repo

TensorFlow Transform Repo

TensorFlow Serving Repo

System information

  • Have I specified the code to reproduce the issue (Yes, No): Yes
  • Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): Kubeflow through Vertex
  • TensorFlow version: 2.7
  • TFX Version: 1.6.1
  • Python version: 3.7
  • Python dependencies (from pip freeze output):

Describe the current behavior Tuner intermittently failing.

Describe the expected behavior Shouldn't fail

Other info / logs Error Best HyperParameters: {'space': [{'class_name': 'Choice', 'config': {'name': 'learning_rate', 'default': 0.0001, 'conditions': [], 'values': [0.0001, 0.001, 0.01, 0.1, 0.2], 'ordered': True}}], 'values': {'learning_rate': 0.01}} "

Error Best Hyperparameters are written to gs://..../Tuner_Logistic_Regression_2754698840043945984/best_hyperparameters/best_hyperparameters.txt. "

Error Terminating chief oracle at PID: 16 "

Error Terminating chief oracle at PID: 16 "

I was finding approximately 1 in every 5 runs were failing with the above logs in Vertex. Looking into the issue further I noticed I had a strange setup in my code:

My Vertex Tuner had num_parallel_trials set as 3 as below: return tfx.extensions.google_cloud_ai_platform.Tuner( module_file=model_trainer, examples=transform.outputs['transformed_examples'], transform_graph=transform.outputs['transform_graph'], schema=schema, train_args=tfx.proto.TrainArgs(num_steps=train_num_steps), eval_args=tfx.proto.EvalArgs(num_steps=eval_num_steps), tune_args=tfx.proto.TuneArgs( # num_parallel_trials=3 means that 3 search loops are # running in parallel. num_parallel_trials=3), custom_config=custom_config).with_id(tuner_id)

But where I was just trying to keep processing time to a minimum while trying out TFX and Vertex I set my Tuner's max_trails to 2. So less than the num_parallel_trials: tuner = kt.RandomSearch( hypermodel=hypermodel, max_trials=2, hyperparameters=hyperparams, seed=123, allow_new_entries=False, objective=kt.Objective('val_binary_accuracy', 'max'), directory=fn_args.working_dir, project_name=project_name)

I've been able to stop the issue by increasing the max_trials to 3, but ideally this wouldn't be necessary or some kind of warning / error describing the issue with the setup.

nroberts1 avatar Apr 04 '22 09:04 nroberts1

@nroberts1

Can you share the complete standalone code to reproduce the issue from our end? Thanks!

pindinagesh avatar Apr 05 '22 05:04 pindinagesh

Unfortunately not. This is part of a application built to automate the building of multiple pipelines that breaks down the end to end process allowing user interaction at different stages rather than something I can put into a notebook to run. I would try reproducing by taking the TFX demo for Vertex and setting the above parameters and see if that reproduced the issue. I'll try and get to this around other work commitments and get back to you, thanks.

nroberts1 avatar Apr 05 '22 08:04 nroberts1

we have i think a very similar bug: for us it happens when you have a search space which is smaller than the number of workers you started: one of the worker might not do anything so then when the oracle says it's finished the process the worker is asked to give its best hpt: it does get_best_hpt but since there have been no trials in that worker it breaks in here: https://github.com/tensorflow/tfx/blob/25d8477eff374a8ba3e90650bfbf3d3ea03f772e/tfx/components/tuner/executor.py#L57 with a key error 0.

tanguycdls avatar Apr 08 '22 08:04 tanguycdls

To complete the issue, here is an example of code reproducing the problem. The phenomenon is random and occurs one time over 6 pipeline runs with the code presented in this notebook.

Thank you for your help.

SylvainGavoille avatar May 05 '22 18:05 SylvainGavoille

Hello @pindinagesh, could you take a look @SylvainGavoille did a reproducible example that fails one over 6 pipeline runs. Can you take a look and tell us if it's an issue in KerasTuner side or here ? but the current python code fails in TFX side.

Thanks,

tanguycdls avatar May 16 '22 08:05 tanguycdls