tfx Tuner intermittently failing

If the bug is related to a specific library below, please raise an issue in the respective repo directly:

System information

Have I specified the code to reproduce the issue (Yes, No): Yes
Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): Kubeflow through Vertex
TensorFlow version: 2.7
TFX Version: 1.6.1
Python version: 3.7
Python dependencies (from pip freeze output):

Describe the current behavior Tuner intermittently failing.

Describe the expected behavior Shouldn't fail

Other info / logs Error Best HyperParameters: {'space': [{'class_name': 'Choice', 'config': {'name': 'learning_rate', 'default': 0.0001, 'conditions': [], 'values': [0.0001, 0.001, 0.01, 0.1, 0.2], 'ordered': True}}], 'values': {'learning_rate': 0.01}} "

Error Best Hyperparameters are written to gs://..../Tuner_Logistic_Regression_2754698840043945984/best_hyperparameters/best_hyperparameters.txt. "

Error Terminating chief oracle at PID: 16 "

I was finding approximately 1 in every 5 runs were failing with the above logs in Vertex. Looking into the issue further I noticed I had a strange setup in my code:

My Vertex Tuner had num_parallel_trials set as 3 as below: return tfx.extensions.google_cloud_ai_platform.Tuner( module_file=model_trainer, examples=transform.outputs['transformed_examples'], transform_graph=transform.outputs['transform_graph'], schema=schema, train_args=tfx.proto.TrainArgs(num_steps=train_num_steps), eval_args=tfx.proto.EvalArgs(num_steps=eval_num_steps), tune_args=tfx.proto.TuneArgs( # num_parallel_trials=3 means that 3 search loops are # running in parallel. num_parallel_trials=3), custom_config=custom_config).with_id(tuner_id)

But where I was just trying to keep processing time to a minimum while trying out TFX and Vertex I set my Tuner's max_trails to 2. So less than the num_parallel_trials: tuner = kt.RandomSearch( hypermodel=hypermodel, max_trials=2, hyperparameters=hyperparams, seed=123, allow_new_entries=False, objective=kt.Objective('val_binary_accuracy', 'max'), directory=fn_args.working_dir, project_name=project_name)

I've been able to stop the issue by increasing the max_trials to 3, but ideally this wouldn't be necessary or some kind of warning / error describing the issue with the setup.

Apr 04 '22 09:04 nroberts1

@nroberts1

Can you share the complete standalone code to reproduce the issue from our end? Thanks!

Apr 05 '22 05:04 pindinagesh

Unfortunately not. This is part of a application built to automate the building of multiple pipelines that breaks down the end to end process allowing user interaction at different stages rather than something I can put into a notebook to run. I would try reproducing by taking the TFX demo for Vertex and setting the above parameters and see if that reproduced the issue. I'll try and get to this around other work commitments and get back to you, thanks.

Apr 05 '22 08:04 nroberts1

we have i think a very similar bug: for us it happens when you have a search space which is smaller than the number of workers you started: one of the worker might not do anything so then when the oracle says it's finished the process the worker is asked to give its best hpt: it does get_best_hpt but since there have been no trials in that worker it breaks in here: https://github.com/tensorflow/tfx/blob/25d8477eff374a8ba3e90650bfbf3d3ea03f772e/tfx/components/tuner/executor.py#L57 with a key error 0.

Apr 08 '22 08:04 tanguycdls

To complete the issue, here is an example of code reproducing the problem. The phenomenon is random and occurs one time over 6 pipeline runs with the code presented in this notebook.

Thank you for your help.

May 05 '22 18:05 SylvainGavoille

Hello @pindinagesh, could you take a look @SylvainGavoille did a reproducible example that fails one over 6 pipeline runs. Can you take a look and tell us if it's an issue in KerasTuner side or here ? but the current python code fails in TFX side.

Thanks,

May 16 '22 08:05 tanguycdls

tfx tfx copied to clipboard

Tuner intermittently failing

tfx
tfx copied to clipboard