ray icon indicating copy to clipboard operation
ray copied to clipboard

KubeRay Deployment Failure with Large ServeZip File in Working_Dir

Open USER-HFC opened this issue 1 year ago • 3 comments
trafficstars

What happened + What you expected to happen

I am using KubeRay with the image ray_ml:2.9.0. I created a server of size 92MB and configured it to the working_dir in the yaml. After starting, the head node's pod did not fully pull the zip file. Checking the container's tmp folder, I found my zip package there but it was not completely downloaded, resulting in an empty folder after unzipping, which caused the deployment to fail. However, when I configure the working_dir to a smaller servezip, this problem does not occur.

Versions / Dependencies

ray_ml:2.9.0 image ubuntu18.0.4 kuberay

Reproduction script

pass

USER-HFC avatar Apr 10 '24 03:04 USER-HFC

@kevin85421

USER-HFC avatar Apr 10 '24 03:04 USER-HFC

cc @GeneDer @fishbone

kevin85421 avatar Apr 11 '24 16:04 kevin85421

@USER-HFC Can you share a minimum reproducible code? Also possibly logs from runtime env agent might show the reason for it failing.

Also, just a suspicion. If your download is known takes longer than 600s, then you can specify this config.setup_timeout_seconds in the runtime env to give it longer setup time https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnvConfig.html#ray-runtime-env-runtimeenvconfig

GeneDer avatar Apr 11 '24 16:04 GeneDer