[Feature] Revisit share memory setting for Ray Cluster
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
In order to avoid harming performance, we out-of-box mount shared memory to ray nodes.
https://github.com/ray-project/kuberay/blob/ee72afc2125bbebd76ee2834c4ef7599d3b53fcd/ray-operator/controllers/common/pod.go#L395-L397
https://github.com/ray-project/kuberay/blob/ee72afc2125bbebd76ee2834c4ef7599d3b53fcd/ray-operator/controllers/common/pod.go#L450-L462
However, we used to use resources.requests.memory as the value of /dev/shm. This is working perfectly when requests == limits but user may encounter some issues if they set requests != limits. See below cases
rayStartParams:
...
object-store-memory: '10240000000'
resources:
limits:
cpu: "1"
memory: "12G"
requests:
cpu: "500m"
memory: "8G"
# Start up failures
ValueError: The configured object store size (10.24 GB) exceeds /dev/shm size (8.0 GB).
This will harm performance. Consider deleting files in /dev/shm or increasing its size with --shm-size in Docker.
To ignore this warning, set RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE=1.
In order to help user easily overcome the issue, let's consider whether we should use limits first instead of requests. There's the tradeoff because we don't want to double memory allocation to every pod, let's run some experiments to find best value then.
Use case
Better mitigate configuration issue.
Related issues
https://github.com/ray-project/ray/pull/14629
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!