adanet
adanet copied to clipboard
Support for tf.distribute.Strategy
AdaNet doesn't currently support tf.distribute.Strategy. The current way to define distributed training is using a tf.estimator.RunConfig with the TF_CONFIG environment variable properly set to identify different workers.
Refs #76
I am new to this source code. I want to contribute to few open source AI, and ML projects to gain experience. Can I work on this issue ? Can you suggest me on what to be done ? I went through the code there are certain TODO's in placement.py written in comment section, if given permission and some guidance can I work on that ?
@cweill Can you just give me a vague idea on how to make adanet support tf.distribute.Strategy. I have good experience with tensorflow but the source code is quite big to search for, it would be helpful for me to make a quick start.
@chandramoulirajagopalan: The best way to get started will be to first extend estimator_distributed_test_runner.py to test your implementation. You can pass then pass the tf.distribute.Strategy you want to test to the tf.estimator.RunConfig, when constructing the AdaNet Estimator. If it works, then great! If it doesn't then feel free to post your update here, and we'll work through it together.
@cweill Yes I will work on that file first to test my implementation on the estimator similar to issue #54. Where the tf.distribute.Strategy.MirroredStrategy was used in it.
@chandramoulirajagopalan: Just a heads up: tf.distribute.MirroredStrategy I believe is designed for multi-GPU, so may be difficult to test. But if you get it to run inside estimator_distributed_test_runner.py, then good work. Let us know if you have any questions.
FAIL: test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps (adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest)
test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps(estimator='estimator_with_distributed_mirrored_strategy', placement_strategy='replication', num_workers=5, num_ps=3)
----------------------------------------------------------------------
Traceback (most recent call last):
/home/chandramouli/.local/lib/python3.6/site-packages/absl/testing/parameterized.py line 262 in bound_param_test
test_method(self, **testcase_params)
adanet/core/estimator_distributed_test.py line 325 in test_distributed_training
timeout_secs=500)
adanet/core/estimator_distributed_test.py line 169 in _wait_for_processes
self.assertEqual(0, ret_code)
AssertionError: 0 != 1
-------------------- >> begin captured logging << --------------------
absl: INFO: Spawning chief_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning worker_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning worker_1 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning worker_2 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning worker_3 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning ps_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning ps_1 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning ps_2 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning evaluator_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: worker_0 finished
absl: INFO: stderr for worker_0 (last 15000 chars): WARNING: Logging before flag parsing goes to stderr.
W0608 00:13:42.569955 140367193364288 report_accessor.py:36] Failed to import report_pb2. ReportMaterializer will not work.
I0608 00:13:42.935579 140367193364288 run_config.py:503] TF_CONFIG environment variable: {'cluster': {'chief': ['localhost:38127'], 'worker': ['localhost:44993', 'localhost:55967', 'localhost:53003', 'localhost:59883'], 'ps': ['localhost:37729', 'localhost:55971', 'localhost:38587']}, 'task': {'type': 'worker', 'index': 0}}
2019-06-08 00:13:42.936921: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-06-08 00:13:42.983584: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1999890000 Hz
2019-06-08 00:13:42.983983: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x20dff50 executing computations on platform Host. Devices:
2019-06-08 00:13:42.984038: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
I0608 00:13:43.000412 140367193364288 cross_device_ops.py:975] Device is available but not used by distribute strategy: /device:XLA_CPU:0
W0608 00:13:43.001587 140367193364288 cross_device_ops.py:983] Not all devices in `tf.distribute.Strategy` are visible to TensorFlow.
I0608 00:13:43.001820 140367193364288 run_config.py:503] TF_CONFIG environment variable: {'cluster': {'chief': ['localhost:38127'], 'worker': ['localhost:44993', 'localhost:55967', 'localhost:53003', 'localhost:59883'], 'ps': ['localhost:37729', 'localhost:55971', 'localhost:38587']}, 'task': {'type': 'worker', 'index': 0}}
I0608 00:13:43.002067 140367193364288 run_config.py:532] Initializing RunConfig with distribution strategies.
I0608 00:13:43.002384 140367193364288 estimator_training.py:176] RunConfig initialized for Distribute Coordinator with INDEPENDENT_WORKER mode
W0608 00:13:43.003975 140367193364288 estimator.py:1760] Using temporary folder as model directory: /tmp/tmpv70w4krt
I0608 00:13:43.004830 140367193364288 estimator.py:201] Using config: {'_model_dir': '/tmp/tmpv70w4krt', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7fa98a429748>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa98a4299b0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': 'independent_worker'}
Traceback (most recent call last):
File "adanet/core/estimator_distributed_test_runner.py", line 350, in <module>
app.run(main)
File "/home/chandramouli/.local/lib/python3.6/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/home/chandramouli/.local/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "adanet/core/estimator_distributed_test_runner.py", line 346, in main
train_and_evaluate_estimator()
File "adanet/core/estimator_distributed_test_runner.py", line 318, in train_and_evaluate_estimator
classifier.train(input_fn=_input_fn)
File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1180, in _train_model_distributed
hooks)
File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow/python/distribute/estimator_training.py", line 302, in estimator_train
if 'evaluator' in cluster_spec:
TypeError: argument of type 'ClusterSpec' is not iterable
--------------------- >> end captured logging << ---------------------
Good work getting that inside the runner. I'm surprised that the error is coming from deep down in TensorFlow Estimator. If you create a PR, I can have a look there.