alpa icon indicating copy to clipboard operation
alpa copied to clipboard

Alpa doesn't work with remote Ray cluster

Open zhanyuanucb opened this issue 1 year ago • 5 comments

Please describe the bug Alpa couldn't connect to a remote Ray cluster

Please describe the expected behavior

System information and environment

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker): Linux Ubuntu
  • Python version:3.8.13
  • CUDA version:11.7
  • NCCL version: 2.8.4-1+cuda11.1 amd64
  • cupy version: 11.5.0
  • GPU model and memory: 11G
  • Alpa version: v0.2.2
  • TensorFlow version:
  • JAX version: 0.3.22

To Reproduce Steps to reproduce the behavior:

  1. Spin up a Ray cluster on a remote server
  2. Run this code
# Starting the Ray client. This connects to a remote Ray cluster.
ray.init("ray://<head_node_host>:10001")
alpa.init(cluster="ray")
  1. See error
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 2137, in __init__
    self.head_info = ray_global_node.address_info
AttributeError: 'NoneType' object has no attribute 'address_info'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 52, in init
    init_global_cluster(cluster, num_nodes, num_devices_per_node, namespace)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 2320, in init_global_cluster
    global_cluster = DeviceCluster(num_nodes, num_devices_per_node)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 2139, in __init__
    raise RuntimeError(
RuntimeError: Cannot access ray global node. Did you call ray.init?

Screenshots If applicable, add screenshots to help explain your problem.

Code snippet to reproduce the problem

Additional information The runtime error is due to that ray_global_node is None here: https://github.com/alpa-projects/alpa/blob/97d4524cb83595063a03b2f35f722191a5cef34a/alpa/device_mesh.py#L2135-L2141 And Alpa obtains this ray_worker from ray.worker._real_worker according to here: https://github.com/alpa-projects/alpa/blob/8a19d84f8af24a56fc58c517e7551b0df6e7db12/alpa/util.py#L1340-L1366

I double-checked with some Ray code submitting job to a remote Ray cluster, and obtained the ray_global_node by

ray_worker = ray.worker._real_worker
ray_global_node = ray_worker._global_node

I found that the code worked despise that ray_global_node is None in this case.

zhanyuanucb avatar Feb 24 '23 17:02 zhanyuanucb