alpa icon indicating copy to clipboard operation
alpa copied to clipboard

Alpa doesn't work with remote Ray cluster

Open zhanyuanucb opened this issue 1 year ago • 5 comments

Please describe the bug Alpa couldn't connect to a remote Ray cluster

Please describe the expected behavior

System information and environment

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker): Linux Ubuntu
  • Python version:3.8.13
  • CUDA version:11.7
  • NCCL version: 2.8.4-1+cuda11.1 amd64
  • cupy version: 11.5.0
  • GPU model and memory: 11G
  • Alpa version: v0.2.2
  • TensorFlow version:
  • JAX version: 0.3.22

To Reproduce Steps to reproduce the behavior:

  1. Spin up a Ray cluster on a remote server
  2. Run this code
# Starting the Ray client. This connects to a remote Ray cluster.
ray.init("ray://<head_node_host>:10001")
alpa.init(cluster="ray")
  1. See error
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 2137, in __init__
    self.head_info = ray_global_node.address_info
AttributeError: 'NoneType' object has no attribute 'address_info'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 52, in init
    init_global_cluster(cluster, num_nodes, num_devices_per_node, namespace)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 2320, in init_global_cluster
    global_cluster = DeviceCluster(num_nodes, num_devices_per_node)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 2139, in __init__
    raise RuntimeError(
RuntimeError: Cannot access ray global node. Did you call ray.init?

Screenshots If applicable, add screenshots to help explain your problem.

Code snippet to reproduce the problem

Additional information The runtime error is due to that ray_global_node is None here: https://github.com/alpa-projects/alpa/blob/97d4524cb83595063a03b2f35f722191a5cef34a/alpa/device_mesh.py#L2135-L2141 And Alpa obtains this ray_worker from ray.worker._real_worker according to here: https://github.com/alpa-projects/alpa/blob/8a19d84f8af24a56fc58c517e7551b0df6e7db12/alpa/util.py#L1340-L1366

I double-checked with some Ray code submitting job to a remote Ray cluster, and obtained the ray_global_node by

ray_worker = ray.worker._real_worker
ray_global_node = ray_worker._global_node

I found that the code worked despise that ray_global_node is None in this case.

zhanyuanucb avatar Feb 24 '23 17:02 zhanyuanucb

Which version of Ray you're using in your remote cluster ? Can you check by

import ray

print(ray.__version__)
print(ray.__commit__)

jiaodong avatar Feb 26 '23 03:02 jiaodong

@jiaodong

>>> print(ray.__version__)
2.1.0
>>> print(ray.__commit__)
23f34d948dae8de9b168667ab27e6cf940b3ae85

zhanyuanucb avatar Feb 27 '23 22:02 zhanyuanucb

@zhanyuanucb Connecting to remote ray cluster will using ray client. Ray client is a grpc proxy, the ray.worker.global_worker will be none, which is why you get this error. Ray client mode is very convinient, although we can workaround it in alpa, but it would be better to address it in ray client inside.

chaokunyang avatar Mar 21 '23 16:03 chaokunyang

@zhanyuanucb Connecting to remote ray cluster will using ray client. Ray client is a grpc proxy, the ray.worker.global_worker will be none, which is why you get this error. Ray client mode is very convinient, although we can workaround it in alpa, but it would be better to address it in ray client inside.

@chaokunyang thx! Just curious, what is some walkaround you suggest on alpa side while people are waiting for the changes in ray client?

zhanyuanucb avatar Mar 21 '23 18:03 zhanyuanucb

@zhanyuanucb ah sorry for the late reply as i was on paternity leave. My recommendation from the perspective of Ray maintainer -- just don't use Ray client. We don't recommend anyone to use it anymore. It's not maintained.

To connect with remote Ray cluster please just use Ray Job Submission https://docs.ray.io/en/master/cluster/running-applications/job-submission/index.html that's closer to a proper way to connecting and interactive with a remote Ray cluster.

jiaodong avatar Mar 22 '23 05:03 jiaodong