ray
ray copied to clipboard
KubeRay Deployment Failure with Large ServeZip File in Working_Dir
What happened + What you expected to happen
I am using KubeRay with the image ray_ml:2.9.0. I created a server of size 92MB and configured it to the working_dir in the yaml. After starting, the head node's pod did not fully pull the zip file. Checking the container's tmp folder, I found my zip package there but it was not completely downloaded, resulting in an empty folder after unzipping, which caused the deployment to fail. However, when I configure the working_dir to a smaller servezip, this problem does not occur.
Versions / Dependencies
ray_ml:2.9.0 image ubuntu18.0.4 kuberay
Reproduction script
pass
@kevin85421
cc @GeneDer @fishbone
@USER-HFC Can you share a minimum reproducible code? Also possibly logs from runtime env agent might show the reason for it failing.
Also, just a suspicion. If your download is known takes longer than 600s, then you can specify this config.setup_timeout_seconds in the runtime env to give it longer setup time https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnvConfig.html#ray-runtime-env-runtimeenvconfig