ray icon indicating copy to clipboard operation
ray copied to clipboard

[RayClient]large object transfer failure

Open yuduber opened this issue 2 years ago • 2 comments

What happened + What you expected to happen

We are training large models and we use ray client to connect to our Ray cluster and very often we need to send large model over the wire from ray client to ray cluster. The transfer behavior is unpredictable and easy to fail for large objects.

Versions / Dependencies

Ray 2.3.0 Python3.6

Reproduction script

# start ray cluster on remote machines, get the ip and port of this ray cluster
import ray
ray.init(address='ray://ip:port')

# if we try smaller b, and have a 2GB array, we will
# have no issue and succeed in about 2 minutes to pass the array to 
# Ray cluster and complete calculation
# but with larger array like 8GB, it will stuck and never succeed.
b = int(1e9)
# 1000,000,000 = 1000,000K = 1000M = 1G
# below will create 8GB array
big_array = []
for i in range(b):
    big_array.append(i)

@ray.remote
def sum_array(x):
    return sum(x)

# will stuck here for longer than 30 minutes and fail
# I expect it takes about 10 minutes to finish
sum_big_array = sum_array.remote(big_array)

Issue Severity

High: It blocks me from completing my task.

yuduber avatar May 17 '23 18:05 yuduber

@yuduber I confirmed with @ckw017. There was no Ray Client changes between Ray 1.13 and 2.0 that may affect object transfer.

In the mean time, you can also try Ray Job, which is supposed for submitting job dependencies to the cluster. It should be more reliable than Ray Client in your case.

raulchen avatar May 19 '23 00:05 raulchen

This P2 issue has seen no activity in the past 2 years. It will be closed in 2 weeks as part of ongoing cleanup efforts.

Please comment and remove the pending-cleanup label if you believe this issue should remain open.

Thanks for contributing to Ray!

cszhu avatar Jun 17 '25 00:06 cszhu