tfx
tfx copied to clipboard
Tuner executor does not follow the new chief-vs-master guideline.
We are trying to run a tuner component in our vertex pipeline. With the default executor (https://github.com/tensorflow/tfx/blob/v1.5.0/tfx/extensions/google_cloud_ai_platform/tuner/executor.py) we get the following error:
File “/home/pipelines/custom_components/custom_aip_tuner/executor.py”, line 264, in __init__
cluster_spec[‘cluster’][‘master’][0].split(‘:’))
KeyError: ‘master’
The component has been launched with the following kwargs:
{'ai_platform_enable_vertex': True,
'ai_platform_training_args': {'project': PROJECT,
'service_account': ACCOUNT,
'worker_pool_specs': [{'container_spec': {'image_uri': IMAGE_URI},
'machine_spec': {'machine_type': 'e2-standard-16'},
'replica_count': 1}]},
'ai_platform_training_job_id': job_id,
'ai_platform_tuning_args': {'project': PROJECT,
'service_account': ACCOUNT,
'worker_pool_specs': [{'container_spec': {'image_uri': IMAGE_URI},
'machine_spec': {'machine_type': 'e2-standard-16'},
'replica_count': 1}]},
'ai_platform_vertex_region': 'europe-west1'
}
Note that training and tuning args are aligned.
We think that this is because the issue comes from the following guideline: https://cloud.google.com/ai-platform/training/docs/distributed-training-details#chief-versus-master
We have therefore written a custom tuner/executor to solve the issue. Here is our diff from the master branch (not that this the diff from the current master, not the latest release):
- self._master_addr, self._master_port = (
- # We rely on Cloud AI Platform Training service’s specification whereby
- # there will be no more than one master replica.
- # https://cloud.google.com/ai-platform/training/docs/distributed-training-containers#cluster-spec-format
- cluster_spec[‘cluster’][‘master’][0].split(‘:’))
+ # ```
+ # AI Platform Training uses chief in the cluster and task fields of the TF_CONFIG environment
+ # variable if any of the following are true
+ # ```
+ # https://cloud.google.com/ai-platform/training/docs/distributed-training-details#chief-versus-master
+ self._master_addr, self._master_port = (cluster_spec[‘cluster’][‘chief’][0].split(‘:’))
- self._is_chief = cluster_spec[‘task’][‘type’] == ‘master’
+ self._is_chief = cluster_spec[‘task’][‘type’] == ‘chief’
A modification was needed on the keras-tuner side as well to handle the chiefness correctly:
+def is_chief_oracle():
+ “”"Return true if the thread is the chief oracle”“”
+ if dist_utils.has_chief_oracle():
+ return os.environ[“KERASTUNER_TUNER_ID”] == “chief”
+ return False
+dist_utils.is_chief_oracle = is_chief_oracle
It may very well be that we did not understand the intend of the original contributors on how to handle the chiefness in tuner. Probably the issue could be fixed by other means than the one we took.
Alternatively, could you give us a config example to make the tuner work in vertex ? We are interested as well by configuration allowing the parallelization of evaluations by the Tuner.
@1025KB Could you comment?
This should fix it somehow cluster_spec is set differently when there are multiple worker vs single worker
@1025KB Is this issue still open for contribution?
The fix is in and will be in next release
Hello,
Thanks for the fix. We gave it a try and it works nicely for the multiple worker case. However, in the single worker case, we are still hitting the same bug. We are using the following spec for both single and multiple worker cases:
'ai_platform_training_args': {
'project': PROJECT,
'service_account': ACCOUNT,
'job_spec': {
'worker_pool_specs': [{
'machine_spec': {
'machine_type': 'e2-standard-16', # 16 vCPU +64G Ram
},
'replica_count': 1,
'container_spec': {
'image_uri': docker_image_full_uri,
},
}]
}
Should we do something different for the single worker case ?
@1025KB Are we aware of any problem with the single worker case for the Tuner?
For the single worker case, you can use Tuner (run in the same worker as component) instead of CloudTuner (a remote job)
@montanier,
Kindly let us know if using Tuner instead if CloudTuner helps in resolving the bug in single worker scenario as mentioned in above comment.
Thank you!