OpenROAD-flow-scripts
OpenROAD-flow-scripts copied to clipboard
AutoTuner failing in distributed mode run
Describe the bug I have tried AutoTuner feature in single mode machine locally its working fine.
With recent update, I tried to run AutoTuner in distributed mode using following command:
python3.7 distributed.py --design fuserisc_v1 --platform sky130hd --config ../designs/sky130hd/fuserisc_v1/autotuner.json --jobs 2000 --server localhost tune --samples 200
But flow failed to complete:
Log:
(run pid=825) ... 180 more trials not shown (180 TERMINATED)
(run pid=825)
(run pid=825)
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
2022-05-03 13:13:37,973 WARNING dataclient.py:221 -- Encountered connection issues in the data channel. Attempting to reconnect.
2022-05-03 13:14:08,189 WARNING dataclient.py:226 -- Failed to reconnect the data channel
Traceback (most recent call last):
File "distributed.py", line 947, in <module>
analysis = tune.run(TrainClass, **tune_args)
File "/home/vijayan/.local/lib/python3.7/site-packages/ray/tune/tune.py", line 363, in run
while ray.wait([remote_future], timeout=0.2)[1]:
File "/home/vijayan/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File "/home/vijayan/.local/lib/python3.7/site-packages/ray/util/client/api.py", line 61, in wait
return self.worker.wait(*args, **kwargs)
File "/home/vijayan/.local/lib/python3.7/site-packages/ray/util/client/worker.py", line 435, in wait
resp = self._call_stub("WaitObject", req, metadata=self.metadata)
File "/home/vijayan/.local/lib/python3.7/site-packages/ray/util/client/worker.py", line 291, in _call_stub
raise ConnectionError("Client is shutting down.")
ConnectionError: Client is shutting down.
Expected behavior Flow should complete successfully in Distributed mode.
@dralabeing FYI
@vijayank88 Is this still an issue? If so, is it possible to share the necessary files for reproduction?
Edit: After trying out it appears that the issue might be the --server localhost. Ray only needs us to supply the --server, --port switch when we are using Ray Cluster[1].
Correct usage:
python3 distributed.py --design fuserisc_v1 --platform sky130hd --config ../designs/sky130hd/fuserisc_v1/autotuner.json --jobs 2000 tune --samples 200
[1] https://docs.ray.io/en/latest/cluster/key-concepts.html#ray-cluster