[RayClient]large object transfer failure
What happened + What you expected to happen
We are training large models and we use ray client to connect to our Ray cluster and very often we need to send large model over the wire from ray client to ray cluster. The transfer behavior is unpredictable and easy to fail for large objects.
Versions / Dependencies
Ray 2.3.0 Python3.6
Reproduction script
# start ray cluster on remote machines, get the ip and port of this ray cluster
import ray
ray.init(address='ray://ip:port')
# if we try smaller b, and have a 2GB array, we will
# have no issue and succeed in about 2 minutes to pass the array to
# Ray cluster and complete calculation
# but with larger array like 8GB, it will stuck and never succeed.
b = int(1e9)
# 1000,000,000 = 1000,000K = 1000M = 1G
# below will create 8GB array
big_array = []
for i in range(b):
big_array.append(i)
@ray.remote
def sum_array(x):
return sum(x)
# will stuck here for longer than 30 minutes and fail
# I expect it takes about 10 minutes to finish
sum_big_array = sum_array.remote(big_array)
Issue Severity
High: It blocks me from completing my task.
@yuduber I confirmed with @ckw017. There was no Ray Client changes between Ray 1.13 and 2.0 that may affect object transfer.
In the mean time, you can also try Ray Job, which is supposed for submitting job dependencies to the cluster. It should be more reliable than Ray Client in your case.
This P2 issue has seen no activity in the past 2 years. It will be closed in 2 weeks as part of ongoing cleanup efforts.
Please comment and remove the pending-cleanup label if you believe this issue should remain open.
Thanks for contributing to Ray!