tfx
tfx copied to clipboard
Tuner intermittently failing
If the bug is related to a specific library below, please raise an issue in the respective repo directly:
TensorFlow Data Validation Repo
TensorFlow Model Analysis Repo
System information
- Have I specified the code to reproduce the issue (Yes, No): Yes
- Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): Kubeflow through Vertex
- TensorFlow version: 2.7
- TFX Version: 1.6.1
- Python version: 3.7
- Python dependencies (from
pip freeze
output):
Describe the current behavior Tuner intermittently failing.
Describe the expected behavior Shouldn't fail
Other info / logs Error Best HyperParameters: {'space': [{'class_name': 'Choice', 'config': {'name': 'learning_rate', 'default': 0.0001, 'conditions': [], 'values': [0.0001, 0.001, 0.01, 0.1, 0.2], 'ordered': True}}], 'values': {'learning_rate': 0.01}} "
Error Best Hyperparameters are written to gs://..../Tuner_Logistic_Regression_2754698840043945984/best_hyperparameters/best_hyperparameters.txt. "
Error Terminating chief oracle at PID: 16 "
Error Terminating chief oracle at PID: 16 "
I was finding approximately 1 in every 5 runs were failing with the above logs in Vertex. Looking into the issue further I noticed I had a strange setup in my code:
My Vertex Tuner had num_parallel_trials set as 3 as below:
return tfx.extensions.google_cloud_ai_platform.Tuner( module_file=model_trainer, examples=transform.outputs['transformed_examples'], transform_graph=transform.outputs['transform_graph'], schema=schema, train_args=tfx.proto.TrainArgs(num_steps=train_num_steps), eval_args=tfx.proto.EvalArgs(num_steps=eval_num_steps), tune_args=tfx.proto.TuneArgs( # num_parallel_trials=3 means that 3 search loops are # running in parallel. num_parallel_trials=3), custom_config=custom_config).with_id(tuner_id)
But where I was just trying to keep processing time to a minimum while trying out TFX and Vertex I set my Tuner's max_trails to 2. So less than the num_parallel_trials:
tuner = kt.RandomSearch( hypermodel=hypermodel, max_trials=2, hyperparameters=hyperparams, seed=123, allow_new_entries=False, objective=kt.Objective('val_binary_accuracy', 'max'), directory=fn_args.working_dir, project_name=project_name)
I've been able to stop the issue by increasing the max_trials to 3, but ideally this wouldn't be necessary or some kind of warning / error describing the issue with the setup.
@nroberts1
Can you share the complete standalone code to reproduce the issue from our end? Thanks!
Unfortunately not. This is part of a application built to automate the building of multiple pipelines that breaks down the end to end process allowing user interaction at different stages rather than something I can put into a notebook to run. I would try reproducing by taking the TFX demo for Vertex and setting the above parameters and see if that reproduced the issue. I'll try and get to this around other work commitments and get back to you, thanks.
we have i think a very similar bug: for us it happens when you have a search space which is smaller than the number of workers you started: one of the worker might not do anything so then when the oracle says it's finished the process the worker is asked to give its best hpt: it does get_best_hpt but since there have been no trials in that worker it breaks in here: https://github.com/tensorflow/tfx/blob/25d8477eff374a8ba3e90650bfbf3d3ea03f772e/tfx/components/tuner/executor.py#L57 with a key error 0.
To complete the issue, here is an example of code reproducing the problem. The phenomenon is random and occurs one time over 6 pipeline runs with the code presented in this notebook.
Thank you for your help.
Hello @pindinagesh, could you take a look @SylvainGavoille did a reproducible example that fails one over 6 pipeline runs. Can you take a look and tell us if it's an issue in KerasTuner side or here ? but the current python code fails in TFX side.
Thanks,