kuberay
kuberay copied to clipboard
[Feature][Helm] Support multiple Ray worker types in the KubeRay Helm chart
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
KubeRay's RayCluster CRD supports multiple worker groups. However, the Helm chart supports only one worker group. It would be good to expose multiple groups.
Implementation ideas:
-
One way to achieve this is to add a
workerGroups
field to the KubeRay CR, which would be a mapping from user-specified string names to configuration that is currently in the worker field. -
Another way is to have an array
workerGroups
.
The Helm docs suggest a map might be preferred over an array if the user uses --set to override values in the chart. However, in my experience, a map can be confusing, because the user may interpret the keys as part of the schema, rather than as arbitrary names.
workerGroups
would either override the current worker
field OR the worker
config would be internally appending to workerGroups
.
Use case
Easy configurability for mixed workloads (e.g. workloads the require GPUs and CPUs).
Related issues
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Just to back up this feature on @DmitriGekhtman has already said, I have found it advantageous to be able to specify pod types, like you can with the Kubernetes helm chart deployment, and then match these up with underlying node pools. It means then you can give the cluster the option of determining the most efficient worker node instance types to use given the resource demands.
For example, giving the cluster building blocks of agent pools of 8, 16, 32 and 64 cpus instances, then the cluster can see what combination of these instance types closely matches the required cpu resource requirement. It avoids the spinning up of say a 32 cpu node when only 8 cpu units are needed.
A Kubernetes based helm chart (i.e. non kuberay
) would look like the following:
podTypes:
# The key for each podType is a user-defined string.
# Since we set headPodType: rayHeadType, the Ray head pod will use the configuration
# defined in this entry of podTypes:
rayHeadType:
# CPU is the number of CPUs used by this pod type.
# (Used for both requests and limits. Must be an integer, as Ray does not support fractional CPUs.)
CPU: 3
# memory is the memory used by this Pod type.
# (Used for both requests and limits.)
memory: 3064Mi
# GPU is the number of NVIDIA GPUs used by this pod type.
# (Optional, requires GPU nodes with appropriate setup. See https://docs.ray.io/en/master/cluster/kubernetes-gpu.html)
GPU: 0
# rayResources is an optional string-int mapping signalling additional resources to Ray.
# "CPU", "GPU", and "memory" are filled automatically based on the above settings, but can be overriden;
# For example, rayResources: {"CPU": 0} can be used in the head podType to prevent Ray from scheduling tasks on the head.
# See https://docs.ray.io/en/master/advanced.html#dynamic-remote-parameters for an example of usage of custom resources in a Ray task.
rayResources: {"CPU": 0}
# Optionally, set a node selector for this podType: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
nodeSelector: {"agentpool": "agentpool"}
# tolerations for Ray pods of this podType (the head's podType in this case)
# ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
# Note that it is often not necessary to manually specify tolerations for GPU
# usage on managed platforms such as AKS, EKS, and GKE.
# ref: https://docs.ray.io/en/master/cluster/kubernetes-gpu.html
tolerations: []
# - key: "nvidia.com/gpu"
# operator: Exists
# effect: NoSchedule
rayWorkerType_default:
# minWorkers is the minimum number of Ray workers of this pod type to keep running.
minWorkers: 0
# maxWorkers is the maximum number of Ray workers of this pod type to which Ray will scale.
maxWorkers: 4
# memory is the memory used by this Pod type.
# (Used for both requests and limits.)
memory: 56000Mi
# CPU is the number of CPUs used by this pod type.
# (Used for both requests and limits. Must be an integer, as Ray does not support fractional CPUs.)
CPU: 15
# GPU is the number of NVIDIA GPUs used by this pod type.
# (Optional, requires GPU nodes with appropriate setup. See https://docs.ray.io/en/master/cluster/kubernetes-gpu.html)
GPU: 0
# rayResources is an optional string-int mapping signalling additional resources to Ray.
# "CPU", "GPU", and "memory" are filled automatically based on the above settings, but can be overriden;
# For example, rayResources: {"CPU": 0} can be used in the head podType to prevent Ray from scheduling tasks on the head.
# See https://docs.ray.io/en/master/advanced.html#dynamic-remote-parameters for an example of usage of custom resources in a Ray task.
rayResources: [] #{"PODTYPE_VLARGE": 15}
# Optionally, set a node selector for this Pod type. See https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
nodeSelector: {"agentpool": "raypool"}
# tolerations for Ray pods of this podType
# ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
# Note that it is often not necessary to manually specify tolerations for GPU
# usage on managed platforms such as AKS, EKS, and GKE.
# ref: https://docs.ray.io/en/master/cluster/kubernetes-gpu.html
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: "Equal"
value: "spot"
effect: "NoSchedule"
# - key: nvidia.com/gpu
# operator: Exists
# effect: NoSchedule
rayWorkerType_large:
# minWorkers is the minimum number of Ray workers of this pod type to keep running.
minWorkers: 0
# maxWorkers is the maximum number of Ray workers of this pod type to which Ray will scale.
maxWorkers: 4
# memory is the memory used by this Pod type.
# (Used for both requests and limits.)
memory: 110000Mi
# CPU is the number of CPUs used by this pod type.
# (Used for both requests and limits. Must be an integer, as Ray does not support fractional CPUs.)
CPU: 31
# GPU is the number of NVIDIA GPUs used by this pod type.
# (Optional, requires GPU nodes with appropriate setup. See https://docs.ray.io/en/master/cluster/kubernetes-gpu.html)
GPU: 0
# rayResources is an optional string-int mapping signalling additional resources to Ray.
# "CPU", "GPU", and "memory" are filled automatically based on the above settings, but can be overriden;
# For example, rayResources: {"CPU": 0} can be used in the head podType to prevent Ray from scheduling tasks on the head.
# See https://docs.ray.io/en/master/advanced.html#dynamic-remote-parameters for an example of usage of custom resources in a Ray task.
rayResources: [] # {"PODTYPE_VLARGE": 15}
# Optionally, set a node selector for this Pod type. See https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
nodeSelector: {"agentpool": "raypoollarge"}
# tolerations for Ray pods of this podType
# ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
# Note that it is often not necessary to manually specify tolerations for GPU
# usage on managed platforms such as AKS, EKS, and GKE.
# ref: https://docs.ray.io/en/master/cluster/kubernetes-gpu.html
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: "Equal"
value: "spot"
effect: "NoSchedule"
# - key: nvidia.com/gpu
# operator: Exists
# effect: NoSchedule
rayWorkerType_small:
# minWorkers is the minimum number of Ray workers of this pod type to keep running.
minWorkers: 0
# maxWorkers is the maximum number of Ray workers of this pod type to which Ray will scale.
maxWorkers: 8
# memory is the memory used by this Pod type.
# (Used for both requests and limits.)
memory: 28000Mi
# CPU is the number of CPUs used by this pod type.
# (Used for both requests and limits. Must be an integer, as Ray does not support fractional CPUs.)
CPU: 7
# GPU is the number of NVIDIA GPUs used by this pod type.
# (Optional, requires GPU nodes with appropriate setup. See https://docs.ray.io/en/master/cluster/kubernetes-gpu.html)
GPU: 0
# rayResources is an optional string-int mapping signalling additional resources to Ray.
# "CPU", "GPU", and "memory" are filled automatically based on the above settings, but can be overriden;
# For example, rayResources: {"CPU": 0} can be used in the head podType to prevent Ray from scheduling tasks on the head.
# See https://docs.ray.io/en/master/advanced.html#dynamic-remote-parameters for an example of usage of custom resources in a Ray task.
rayResources: [] #{"PODTYPE_VLARGE": 15}
# Optionally, set a node selector for this Pod type. See https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
nodeSelector: {"agentpool": "raypoolsmall"}
# tolerations for Ray pods of this podType
# ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
# Note that it is often not necessary to manually specify tolerations for GPU
# usage on managed platforms such as AKS, EKS, and GKE.
# ref: https://docs.ray.io/en/master/cluster/kubernetes-gpu.html
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: "Equal"
value: "spot"
effect: "NoSchedule"
# - key: nvidia.com/gpu
# operator: Exists
# effect: NoSchedule
A Kubernetes based helm chart (i.e. non kuberay)
KubeRay is also Kubernetes-based :) Just a different implementation targeting similar problems.
agent pools of 8, 16, 32 and 64 cpus instances
That makes from a cost perspective! On the other hand, you may get better performance running a few big Ray nodes vs. many small ones -- with a few big nodes, there's less network overhead and fewer local Ray control structures running. The goal is to find the right balance :)
KubeRay is also Kubernetes-based :) Just a different implementation targeting similar problems.
@DmitriGekhtman you are indeed most correct sir, it was a poor description by myself. I was attempting to differentiate from the kuberay
implementation. :-)
You make some good points on the size of nodes!
@DmitriGekhtman was wondering if something like this would work:
Modify the workerGroupSpecs
section (as you suggested) of the raycluster-cluster.yaml
template of the helm chart for kuberay ray-cluster
to accept a nested structure. The section in the values.yaml
will be named workers
and then each section defines the name of the worker group to be defined.
The template is as follows:
workerGroupSpecs:
{{- range $key, $val := .Values.workers }}
- groupName: {{ $val.groupName }}
rayStartParams:
{{- range $key, $val := $val.initArgs }}
{{ $key }}: {{ $val | quote }}
{{- end }}
replicas: {{ $val.replicas }}
minReplicas: {{ $val.miniReplicas | default 1 }}
maxReplicas: {{ $val.maxiReplicas | default 2147483647 }}
template:
spec:
imagePullSecrets: {{- toYaml .Values.imagePullSecrets | nindent 10 }}
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
containers:
- volumeMounts: {{- toYaml $val.volumeMounts | nindent 12 }}
name: ray-worker
image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
resources: {{- toYaml $val.resources | nindent 14 }}
env:
- name: TYPE
value: worker
{{- toYaml $val.containerEnv | nindent 14}}
{{- with $val.envFrom }}
envFrom: {{- toYaml . | nindent 14}}
{{- end }}
ports: {{- toYaml $val.ports | nindent 14}}
volumes: {{- toYaml $val.volumes | nindent 10 }}
affinity: {{- toYaml $val.affinity | nindent 10 }}
tolerations: {{- toYaml $val.tolerations | nindent 10 }}
nodeSelector: {{- toYaml $val.nodeSelector | nindent 10 }}
metadata:
annotations: {{- toYaml $val.annotations | nindent 10 }}
labels:
rayNodeType: {{ $val.type }}
groupName: {{ $val.groupName }}
rayCluster: {{ include "ray-cluster.fullname" . }}
{{- end }}
An example of the values.yaml
of the helm chart would be as follows below.
Here I have modified the resources section to match the number of CPUs on the instance, where I interpret resources.limits.cpu
to be the total number of cpu units available on that worker type, and resources.requests.cpu
to be the minimum increment cpu unit allowed to be requested, did I get that right?
To match up the worker type specification to the correct compute instance type, the nodeSelector
field is used to select which node pool this type of worker is allowed to be deployed to.
workers:
# First worker group type specification
worker_type_default:
groupName: workergroupdefault
replicas: 1
type: worker
labels:
key: value
initArgs:
node-ip-address: $MY_POD_IP
redis-password: LetMeInRay
block: 'true'
containerEnv:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: "1"
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: requests.cpu
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
envFrom: []
# - secretRef:
# name: my-env-secret
ports:
- containerPort: 80
protocol: TCP
resources:
limits:
cpu: 15
requests:
cpu: 1
annotations:
key: value
nodeSelector: {"agentpool": "raypool"}
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: "Equal"
value: "spot"
effect: "NoSchedule"
affinity: {}
volumes:
- name: log-volume
emptyDir: {}
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
# Second worker group type specification
worker_type_small:
groupName: workergroupsmall
replicas: 1
type: worker
labels:
key: value
initArgs:
node-ip-address: $MY_POD_IP
redis-password: LetMeInRay
block: 'true'
containerEnv:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: "1"
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: requests.cpu
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
envFrom: []
# - secretRef:
# name: my-env-secret
ports:
- containerPort: 80
protocol: TCP
resources:
limits:
cpu: 7
requests:
cpu: 1
annotations:
key: value
nodeSelector: {"agentpool": "raypoolsmall"}
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: "Equal"
value: "spot"
effect: "NoSchedule"
affinity: {}
volumes:
- name: log-volume
emptyDir: {}
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
# Third worker group type specification
worker_type_large:
groupName: workergrouplarge
replicas: 1
type: worker
labels:
key: value
initArgs:
node-ip-address: $MY_POD_IP
redis-password: LetMeInRay
block: 'true'
containerEnv:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: "1"
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: requests.cpu
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
envFrom: []
# - secretRef:
# name: my-env-secret
ports:
- containerPort: 80
protocol: TCP
resources:
limits:
cpu: 31
requests:
cpu: 1
annotations:
key: value
nodeSelector: {"agentpool": "raypoollarge"}
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: "Equal"
value: "spot"
effect: "NoSchedule"
affinity: {}
volumes:
- name: log-volume
emptyDir: {}
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
I also forgot to mention, in the above example, I added tolerations
as I tend to use spot instances on Azure to save cost.
These weren't needed for demonstration.
@DmitriGekhtman I put an initial PR (#264) up above, but I couldn't add you as a reviewer as I think I don't have permissions.
This has been solved.