[Feature] Revisit share memory setting for Ray Cluster

Open Jeffwan opened this issue 3 years ago • 0 comments

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

In order to avoid harming performance, we out-of-box mount shared memory to ray nodes.

https://github.com/ray-project/kuberay/blob/ee72afc2125bbebd76ee2834c4ef7599d3b53fcd/ray-operator/controllers/common/pod.go#L395-L397

https://github.com/ray-project/kuberay/blob/ee72afc2125bbebd76ee2834c4ef7599d3b53fcd/ray-operator/controllers/common/pod.go#L450-L462

However, we used to use resources.requests.memory as the value of /dev/shm. This is working perfectly when requests == limits but user may encounter some issues if they set requests != limits. See below cases

rayStartParams:
  ...
  object-store-memory: '10240000000'
resources:
  limits:
    cpu: "1"
    memory: "12G"
  requests:
    cpu: "500m"
    memory: "8G"

# Start up failures
ValueError: The configured object store size (10.24 GB) exceeds /dev/shm size (8.0 GB). 
This will harm performance. Consider deleting files in /dev/shm or increasing its size with --shm-size in Docker. 
To ignore this warning, set RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE=1.

In order to help user easily overcome the issue, let's consider whether we should use limits first instead of requests. There's the tradeoff because we don't want to double memory allocation to every pod, let's run some experiments to find best value then.

Use case

Better mitigate configuration issue.

Related issues

https://github.com/ray-project/ray/pull/14629

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Mar 22 '22 01:03 Jeffwan