BatchAI Provide TF_CONFIG environment variable for distributed TensorFlow

Provide TF_CONFIG environment variable for distributed TensorFlow

Open damienpontifex opened this issue 6 years ago • 23 comments

The TensorFlow ClusterConfig can parse worker and parameter server settings from a TF_CONFIG environment variable (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L64-L156)

I was trying to pass it via an environment variable in the job configuration file like so:

"{ 'cluster': { 'ps': $AZ_BATCHAI_PS_HOSTS, 'worker': $AZ_BATCHAI_WORKER_HOSTS }, 'task': { 'index': $AZ_BATCHAI_TASK_INDEX, 'type': '' } }"

Which is kind of fine, but falls down for a few cases:

When there are no parameter servers (i.e. single node) the ps hosts should be an empty array, but in this case it's just an empty string.
The variables for hosts and workers are comma separated and the TF code parses it as JSON, so would ideally be an array type inside this string.
The 'task.type' property can be 'master', 'worker' or 'ps' but that doesn't seem to have an appropriate environment variable and I had to pass the option via command line args

More generally though, providing this configuration via a TF_CONFIG environment variable would significantly lower the bar to get distributed training working in TensorFlow and Azure Batch. It would also simplify command line arg parameters and mean just the appropriate data directories would need to be passed and mean the same arguments could be used across master, worker and ps potentially simplify the tensorflowSettings property further.

Dec 21 '17 07:12 damienpontifex

Hi Damien, Thank you for the feedback! We will figure out how to make use of TF_CONFIG for tensorflow framework. At first glance, we can just introduce dedicated environment variables for using with TF_CONFIG.

Dec 21 '17 17:12 AlexanderYukhanov

Great, thanks for the response. It is a JSON serialised dictionary in an environment variable, but would mean distributed training would ‘just work ™️’.

Dec 21 '17 22:12 damienpontifex

Looking at Azure Batch AI environment variables it seems this is now available.

Mar 14 '18 01:03 damienpontifex

Sorry, the functionality is not released yet.

Mar 14 '18 04:03 AlexanderYukhanov

May I ask if anyone or @damienpontifex know what is the env variable for the master host? I encounter "valueerror: if "cluster" is set in tf_config, it must have one "chief" node." you can reference here the chief node in the cluster spec https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig

Mar 18 '18 15:03 wtam

@wtam my understanding is you must have the task type and index set appropriately for the chief. In the page you linked this is {'cluster': cluster, 'task': {'type': 'chief', 'index': 0}}) where the cluster variable has three keys: chief, ps and worker.

Without seeing your actual code, seems the minimum requirement is cluster to have {'chief': ['host0:2222']}. You can have a look at the logic in RunConfig to see if there's a case with your setup that you have configured wrong.

Mar 19 '18 10:03 damienpontifex

@damienpontifex Thanks so much for the respond. Since BatchAI only have these env var on BatchAI, $AZ_BATCHAI_PS_HOSTS, $AZ_BATCHAI_WORKER_HOSTS & $AZ_BATCHAI_TASK_INDEX. I overcome the chief node define issueabove by manually reserve the 1st workerhost as chief node and put into the cluster spec. Now I move a bit forward but encounter another issue from RunConfig below: ValueError: worker is not a valid task_type in the cluster_spec???????: <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fdd35049750> Not sure where goes wrong? My cluster is 3 nodes, 1 node reserved for PS and chief node and another 2 nodes for worker node. Appreciate any comment or suggestion to help me out.

This is the cluster spec for the failed worker '{"cluster": {"chief": ["10.0.0.4:2223"], "worker_hosts": ["10.0.0.5:2222", "10.0.0.6:2222"], "ps_hosts": ["10.0.0.4:2222"]}, "task": {"index": "1", "type": "worker"}}')

Mar 19 '18 12:03 wtam

Stupid mistake I made on the cluster spec naming, RunConfig is trying to find worker from my worker_hosts and that why I got the ValueError. For people play around with Estimator Distributed GPU on BatchAI, better wait for its support as the away I reserved the worker node also require me to deduct the $AZ_BATCHAI_TASK_INDEX manually in the cluster spec for the workers.

Mar 20 '18 11:03 wtam

Hi @damienpontifex, maybe you have already known that, Batch AI is now automatically generating TF_CONFIG env var when running tensorflow job. Would you please try it out and please let us know if it works for you? Thanks!

Apr 17 '18 18:04 llidev

Hi @lliimsft, I'm seeing the automatically generated TF_CONFIG env var with nodeCount 1 as: {'task': {'type': 'master', 'index': 0}, 'cluster': {'ps': [''], 'worker': ['10.0.0.4:2222']}, 'environment': 'cloud'} which doesn't seem to work in this 1-node cluster scenario?

Apr 26 '18 22:04 yangsiyu007

Getting this error when running with nodeCount=3 in the stderr-ps-0.txt log

"ValueError: If "cluster" is set in TF_CONFIG, it must have one "chief" node."

For this task the TF_CONFIG variable was:

{'cluster': {'worker': ['10.0.0.4:2223', '10.0.0.5:2222', '10.0.0.6:2222'], 'ps': ['10.0.0.4:2222']}, 'task': {'type': 'ps', 'index': 0}, 'environment': 'cloud'}

The worker logs just had "Warning: Permanently added '[10.0.0.5]:23' (ECDSA) to the list of known hosts."

Apr 27 '18 03:04 damienpontifex

I put the code I'm running here https://github.com/damienpontifex/BatchAIMnist

From the repo, I do:

sh prepare-cluster.sh
sh data-prep.sh
# Wait until data prep done
sh train.sh

Apr 27 '18 03:04 damienpontifex

Looking at the documentation, wondering whether the TF_CONFIG value should be:

On the parameter server: {'cluster': {'chief': ['10.0.0.4:2224'], 'worker': ['10.0.0.4:2223', '10.0.0.5:2222', '10.0.0.6:2222'], 'ps': ['10.0.0.4:2222']}, 'task': {'type': 'ps', 'index': 0}, 'environment': 'cloud'}

On the chief {'cluster': {'chief': ['10.0.0.4:2224'], 'worker': ['10.0.0.4:2223', '10.0.0.5:2222', '10.0.0.6:2222'], 'ps': ['10.0.0.4:2222']}, 'task': {'type': 'chief', 'index': 0}, 'environment': 'cloud'}

I can't seem to find guidance on having all of chief, ps and worker on the same machine etc as the docstring https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/estimator/run_config.py#L351-L376 seems to have them all as separate machines.

How can we assist to test and get this working?

Apr 30 '18 02:04 damienpontifex

@lliimsft, @AlexanderYukhanov, can we please get some update on this? :)

Apr 30 '18 07:04 yangsiyu007

@damienpontifex @yangsiyu007 The TF_CONFIG environment variable offered by Batch AI is based on TensorFlow Trainer Development Considerations, where the cluster only contains ps/worker, and task type will be master, worker, or ps. Although according to run_config.py, tensorflow is now accepting more options such as "chief", which seems to be confused for us (not sure how it differentiates from "master"). We are looking at this.

Apr 30 '18 08:04 llidev

Thank you @lliimsft @yangsiyu007. I also wasn't aware of the change and thank you for the continued effort to support this.

Apr 30 '18 12:04 damienpontifex

Hello guys, just wondering if Batch AI is generating the new format of TF_CONFIG now?

Jul 24 '18 23:07 awan-10

I don't think so - not when I tried it the week before last... @lliimsft updates?

Jul 25 '18 02:07 yangsiyu007

@yangsiyu007 @awan-10 This work is still in progress. We will keep you updated in this post.

Jul 26 '18 21:07 llidev

I was looking at what is currently being set and what changes need to make the RunConfig parse it correctly. Investigations outlined below and will look into updating the TF_CONFIG variable on each machine through code to ensure this change is successful. @lliimsft could this below help in making the appropriate changes?

To verify what JSON structure worked I setup:

os.environ['TF_CONFIG'] = TF_CONFIG_JSON_STRING

config = tf.estimator.RunConfig()
print('master => {}'.format(config.master))
print('task_id => {}'.format(config.task_id))
print('num_ps_replicas => {}'.format(config.num_ps_replicas))
print('num_worker_replicas => {}'.format(config.num_worker_replicas))
print('cluster_spec => {}'.format(config.cluster_spec))
print('task_type => {}'.format(config.task_type))
print('is_chief => {}'.format(config.is_chief))

Run with a 3 node job configured with 1 parameter server and 3 worker count

Current

Currently in Batch AI we get the TF_CONFIG environment variable being: In ps-0

{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"ps","index":0},"environment":"cloud"}

wk-0

{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}

wk-1

{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":1},"environment":"cloud"}

wk-2

{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":2},"environment":"cloud"}

With these, the python code above gave the error:

ValueError: If "cluster" is set in TF_CONFIG, it must have one "chief" node.

Working

To get this working, we apparently need the master worker defined under chief in the cluster. As such, the 'cluster' part of the JSON object would become:

"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]}

Then the task component would be changed for whichever node is initiated from masterCommandLineArgs and have task of:

"task":{"type":"chief","index":0}

The other worker nodes would have the same as before with index now being 0 or 1 e.g.

"task":{"type":"worker","index":1}

Testing

This sample code parses into the RunConfig correctly, but I haven't tested it on a cluster and an estimator yet to see if it hooks it all up fine:

import os
import json
import tensorflow as tf

def log_config_for(runconfig_string):
  os.environ['TF_CONFIG'] = runconfig_string

  config = tf.estimator.RunConfig()
  print('master => {}'.format(config.master))
  print('task_id => {}'.format(config.task_id))
  print('num_ps_replicas => {}'.format(config.num_ps_replicas))
  print('num_worker_replicas => {}'.format(config.num_worker_replicas))
  print('cluster_spec => {}'.format(config.cluster_spec))
  print('task_type => {}'.format(config.task_type))
  print('is_chief => {}'.format(config.is_chief))
  print()

def main():

  machine_definitions = [ 
    # Machine expected from settings with parameterServerCommandLineArgs
    '{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"ps","index":0},"environment":"cloud"}',
    # Machine expected from settings with masterCommandLineArgs
    '{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}',
    # Machine expected from settings with workerCommandLineArgs
    '{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}',
    # Machine expected from settings with workerCommandLineArgs
    '{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":1},"environment":"cloud"}'
  ]

  for definition in machine_definitions:
    log_config_for(definition)

if __name__ == '__main__':
  main()

Sep 21 '18 12:09 damienpontifex

I found a workaround and I was able to manipulate the TF_CONFIG environment variable and get it working and put the code here https://github.com/damienpontifex/batchai-tfconfig-workaround

The environment variable manipulation was:

def remap_tfconfig(is_master):
  tf_config = json.loads(os.environ['TF_CONFIG'])
  master_worker = tf_config['cluster']['worker'][0]
  tf_config['cluster']['worker'] = tf_config['cluster']['worker'][1:]
  tf_config['cluster']['chief'] = [master_worker]
  if is_master:
    tf_config['task']['type'] = 'chief'
    tf_config['task']['index'] = 0
  elif tf_config['task']['type'] == 'worker':
    tf_config['task']['index'] -= 1
  
  os.environ['TF_CONFIG'] = json.dumps(tf_config)

And I pass in --master through to the masterCommandLineArgs that gets received by ArgumentParser by parser.add_argument('--master', action='store_true'). Then just call remap_tfconfig(args.master) after parse_args

Hopefully this can help in getting the fix into Batch AI 😄

Sep 22 '18 02:09 damienpontifex

Tried this again today in Azure ML Workspace with 'Machine Learning Compute' and following the Parameter Server setup and got an error

Run failed: argument of type 'ClusterSpec' is not iterable

Getting the TF_CONFIG quite right seems to still be an issue

Jun 17 '19 06:06 damienpontifex

Thank you @lliimsft @yangsiyu007. I also wasn't aware of the change and thank you for the continued effort to support this.

I found this description of chief vs. master: https://cloud.google.com/ai-platform/training/docs/distributed-training-details#chief-versus-master

Based on it, master is unsupported in TF2 and should be replaced with chief

Apr 27 '20 16:04 robertlugg

BatchAI BatchAI copied to clipboard

Provide TF_CONFIG environment variable for distributed TensorFlow

Current

Working

Testing

BatchAI
BatchAI copied to clipboard