alpa
alpa copied to clipboard
Alpa doesn't work with remote Ray cluster
Please describe the bug Alpa couldn't connect to a remote Ray cluster
Please describe the expected behavior
System information and environment
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker): Linux Ubuntu
- Python version:3.8.13
- CUDA version:11.7
- NCCL version: 2.8.4-1+cuda11.1 amd64
- cupy version: 11.5.0
- GPU model and memory: 11G
- Alpa version: v0.2.2
- TensorFlow version:
- JAX version: 0.3.22
To Reproduce Steps to reproduce the behavior:
- Spin up a Ray cluster on a remote server
- Run this code
# Starting the Ray client. This connects to a remote Ray cluster.
ray.init("ray://<head_node_host>:10001")
alpa.init(cluster="ray")
- See error
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 2137, in __init__
self.head_info = ray_global_node.address_info
AttributeError: 'NoneType' object has no attribute 'address_info'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 52, in init
init_global_cluster(cluster, num_nodes, num_devices_per_node, namespace)
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 2320, in init_global_cluster
global_cluster = DeviceCluster(num_nodes, num_devices_per_node)
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 2139, in __init__
raise RuntimeError(
RuntimeError: Cannot access ray global node. Did you call ray.init?
Screenshots If applicable, add screenshots to help explain your problem.
Code snippet to reproduce the problem
Additional information
The runtime error is due to that ray_global_node
is None
here:
https://github.com/alpa-projects/alpa/blob/97d4524cb83595063a03b2f35f722191a5cef34a/alpa/device_mesh.py#L2135-L2141
And Alpa obtains this ray_worker
from ray.worker._real_worker
according to here:
https://github.com/alpa-projects/alpa/blob/8a19d84f8af24a56fc58c517e7551b0df6e7db12/alpa/util.py#L1340-L1366
I double-checked with some Ray code submitting job to a remote Ray cluster, and obtained the ray_global_node
by
ray_worker = ray.worker._real_worker
ray_global_node = ray_worker._global_node
I found that the code worked despise that ray_global_node
is None
in this case.
Which version of Ray you're using in your remote cluster ? Can you check by
import ray
print(ray.__version__)
print(ray.__commit__)
@jiaodong
>>> print(ray.__version__)
2.1.0
>>> print(ray.__commit__)
23f34d948dae8de9b168667ab27e6cf940b3ae85
@zhanyuanucb Connecting to remote ray cluster will using ray client. Ray client is a grpc proxy, the ray.worker.global_worker
will be none, which is why you get this error. Ray client mode is very convinient, although we can workaround it in alpa, but it would be better to address it in ray client inside.
@zhanyuanucb Connecting to remote ray cluster will using ray client. Ray client is a grpc proxy, the
ray.worker.global_worker
will be none, which is why you get this error. Ray client mode is very convinient, although we can workaround it in alpa, but it would be better to address it in ray client inside.
@chaokunyang thx! Just curious, what is some walkaround you suggest on alpa side while people are waiting for the changes in ray client?
@zhanyuanucb ah sorry for the late reply as i was on paternity leave. My recommendation from the perspective of Ray maintainer -- just don't use Ray client. We don't recommend anyone to use it anymore. It's not maintained.
To connect with remote Ray cluster please just use Ray Job Submission https://docs.ray.io/en/master/cluster/running-applications/job-submission/index.html that's closer to a proper way to connecting and interactive with a remote Ray cluster.