tfx icon indicating copy to clipboard operation
tfx copied to clipboard

Tuner executor does not follow the new chief-vs-master guideline.

Open montanier opened this issue 3 years ago • 7 comments

We are trying to run a tuner component in our vertex pipeline. With the default executor (https://github.com/tensorflow/tfx/blob/v1.5.0/tfx/extensions/google_cloud_ai_platform/tuner/executor.py) we get the following error:

File “/home/pipelines/custom_components/custom_aip_tuner/executor.py”, line 264, in __init__
    cluster_spec[‘cluster’][‘master’][0].split(‘:’))
KeyError: ‘master’

The component has been launched with the following kwargs:

{'ai_platform_enable_vertex': True,
 'ai_platform_training_args': {'project': PROJECT,
                               'service_account': ACCOUNT,
                               'worker_pool_specs': [{'container_spec': {'image_uri': IMAGE_URI},
                                                      'machine_spec': {'machine_type': 'e2-standard-16'},
                                                      'replica_count': 1}]},
 'ai_platform_training_job_id': job_id,
 'ai_platform_tuning_args': {'project': PROJECT,
                               'service_account': ACCOUNT,
                             'worker_pool_specs': [{'container_spec': {'image_uri': IMAGE_URI},
                                                    'machine_spec': {'machine_type': 'e2-standard-16'},
                                                    'replica_count': 1}]},
 'ai_platform_vertex_region': 'europe-west1'
 }

Note that training and tuning args are aligned.

We think that this is because the issue comes from the following guideline: https://cloud.google.com/ai-platform/training/docs/distributed-training-details#chief-versus-master We have therefore written a custom tuner/executor to solve the issue. Here is our diff from the master branch (not that this the diff from the current master, not the latest release):

-        self._master_addr, self._master_port = (
-            # We rely on Cloud AI Platform Training service’s specification whereby
-            # there will be no more than one master replica.
-            # https://cloud.google.com/ai-platform/training/docs/distributed-training-containers#cluster-spec-format
-            cluster_spec[‘cluster’][‘master’][0].split(‘:’))
+        # ```
+        # AI Platform Training uses chief in the cluster and task fields of the TF_CONFIG environment
+        # variable if any of the following are true
+        # ```
+        # https://cloud.google.com/ai-platform/training/docs/distributed-training-details#chief-versus-master
+        self._master_addr, self._master_port = (cluster_spec[‘cluster’][‘chief’][0].split(‘:’))
-        self._is_chief = cluster_spec[‘task’][‘type’] == ‘master’
+        self._is_chief = cluster_spec[‘task’][‘type’] == ‘chief’

A modification was needed on the keras-tuner side as well to handle the chiefness correctly:

+def is_chief_oracle():
+    “”"Return true if the thread is the chief oracle”“”
+    if dist_utils.has_chief_oracle():
+        return os.environ[“KERASTUNER_TUNER_ID”] == “chief”
+    return False
+dist_utils.is_chief_oracle = is_chief_oracle

It may very well be that we did not understand the intend of the original contributors on how to handle the chiefness in tuner. Probably the issue could be fixed by other means than the one we took.

Alternatively, could you give us a config example to make the tuner work in vertex ? We are interested as well by configuration allowing the parallelization of evaluations by the Tuner.

montanier avatar Jan 17 '22 10:01 montanier

@1025KB Could you comment?

rcrowe-google avatar Jan 27 '22 20:01 rcrowe-google

This should fix it somehow cluster_spec is set differently when there are multiple worker vs single worker

1025KB avatar Jan 27 '22 20:01 1025KB

@1025KB Is this issue still open for contribution?

Aditya-Jha2002 avatar Feb 03 '22 19:02 Aditya-Jha2002

The fix is in and will be in next release

1025KB avatar Feb 03 '22 19:02 1025KB

Hello,

Thanks for the fix. We gave it a try and it works nicely for the multiple worker case. However, in the single worker case, we are still hitting the same bug. We are using the following spec for both single and multiple worker cases:

'ai_platform_training_args': {
        'project': PROJECT,
        'service_account': ACCOUNT,
        'job_spec': {
            'worker_pool_specs': [{
                'machine_spec': {
                    'machine_type': 'e2-standard-16',  # 16 vCPU +64G Ram
                },
                'replica_count': 1,
                'container_spec': {
                    'image_uri': docker_image_full_uri,
                },
            }]
        }

Should we do something different for the single worker case ?

montanier avatar Mar 03 '22 15:03 montanier

@1025KB Are we aware of any problem with the single worker case for the Tuner?

rcrowe-google avatar Mar 26 '22 23:03 rcrowe-google

For the single worker case, you can use Tuner (run in the same worker as component) instead of CloudTuner (a remote job)

1025KB avatar Mar 27 '22 04:03 1025KB

@montanier,

Kindly let us know if using Tuner instead if CloudTuner helps in resolving the bug in single worker scenario as mentioned in above comment.

Thank you!

singhniraj08 avatar Nov 08 '22 10:11 singhniraj08