OpenROAD-flow-scripts icon indicating copy to clipboard operation
OpenROAD-flow-scripts copied to clipboard

AutoTuner failing in distributed mode run

Open vijayank88 opened this issue 3 years ago • 1 comments

Describe the bug I have tried AutoTuner feature in single mode machine locally its working fine.

With recent update, I tried to run AutoTuner in distributed mode using following command: python3.7 distributed.py --design fuserisc_v1 --platform sky130hd --config ../designs/sky130hd/fuserisc_v1/autotuner.json --jobs 2000 --server localhost tune --samples 200 But flow failed to complete:

Log:

(run pid=825) ... 180 more trials not shown (180 TERMINATED)
(run pid=825) 
(run pid=825) 
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
2022-05-03 13:13:37,973	WARNING dataclient.py:221 -- Encountered connection issues in the data channel. Attempting to reconnect.
2022-05-03 13:14:08,189	WARNING dataclient.py:226 -- Failed to reconnect the data channel
Traceback (most recent call last):
  File "distributed.py", line 947, in <module>
    analysis = tune.run(TrainClass, **tune_args)
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/tune/tune.py", line 363, in run
    while ray.wait([remote_future], timeout=0.2)[1]:
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/util/client/api.py", line 61, in wait
    return self.worker.wait(*args, **kwargs)
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/util/client/worker.py", line 435, in wait
    resp = self._call_stub("WaitObject", req, metadata=self.metadata)
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/util/client/worker.py", line 291, in _call_stub
    raise ConnectionError("Client is shutting down.")
ConnectionError: Client is shutting down.

Expected behavior Flow should complete successfully in Distributed mode.

@dralabeing FYI

vijayank88 avatar May 09 '22 16:05 vijayank88

@vijayank88 Is this still an issue? If so, is it possible to share the necessary files for reproduction?

Edit: After trying out it appears that the issue might be the --server localhost. Ray only needs us to supply the --server, --port switch when we are using Ray Cluster[1].

Correct usage:

python3 distributed.py --design fuserisc_v1 --platform sky130hd --config ../designs/sky130hd/fuserisc_v1/autotuner.json --jobs 2000 tune --samples 200

[1] https://docs.ray.io/en/latest/cluster/key-concepts.html#ray-cluster

luarss avatar Mar 29 '24 08:03 luarss