kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature][Helm] Support multiple Ray worker types in the KubeRay Helm chart

Open DmitriGekhtman opened this issue 2 years ago • 7 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

KubeRay's RayCluster CRD supports multiple worker groups. However, the Helm chart supports only one worker group. It would be good to expose multiple groups.

Implementation ideas:

  • One way to achieve this is to add a workerGroups field to the KubeRay CR, which would be a mapping from user-specified string names to configuration that is currently in the worker field.

  • Another way is to have an array workerGroups.

The Helm docs suggest a map might be preferred over an array if the user uses --set to override values in the chart. However, in my experience, a map can be confusing, because the user may interpret the keys as part of the schema, rather than as arbitrary names.

workerGroups would either override the current worker field OR the worker config would be internally appending to workerGroups.

Use case

Easy configurability for mixed workloads (e.g. workloads the require GPUs and CPUs).

Related issues

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

DmitriGekhtman avatar May 11 '22 17:05 DmitriGekhtman

Just to back up this feature on @DmitriGekhtman has already said, I have found it advantageous to be able to specify pod types, like you can with the Kubernetes helm chart deployment, and then match these up with underlying node pools. It means then you can give the cluster the option of determining the most efficient worker node instance types to use given the resource demands.

For example, giving the cluster building blocks of agent pools of 8, 16, 32 and 64 cpus instances, then the cluster can see what combination of these instance types closely matches the required cpu resource requirement. It avoids the spinning up of say a 32 cpu node when only 8 cpu units are needed.

ecm200 avatar May 11 '22 18:05 ecm200

A Kubernetes based helm chart (i.e. non kuberay) would look like the following:

podTypes:
    # The key for each podType is a user-defined string.
    # Since we set headPodType: rayHeadType, the Ray head pod will use the configuration
    # defined in this entry of podTypes:
    rayHeadType:
        # CPU is the number of CPUs used by this pod type.
        # (Used for both requests and limits. Must be an integer, as Ray does not support fractional CPUs.)
        CPU: 3
        # memory is the memory used by this Pod type.
        # (Used for both requests and limits.)
        memory: 3064Mi
        # GPU is the number of NVIDIA GPUs used by this pod type.
        # (Optional, requires GPU nodes with appropriate setup. See https://docs.ray.io/en/master/cluster/kubernetes-gpu.html)
        GPU: 0
        # rayResources is an optional string-int mapping signalling additional resources to Ray.
        # "CPU", "GPU", and "memory" are filled automatically based on the above settings, but can be overriden;
        # For example, rayResources: {"CPU": 0} can be used in the head podType to prevent Ray from scheduling tasks on the head.
        # See https://docs.ray.io/en/master/advanced.html#dynamic-remote-parameters for an example of usage of custom resources in a Ray task.
        rayResources: {"CPU": 0}
        # Optionally, set a node selector for this podType: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
        nodeSelector: {"agentpool": "agentpool"}

        # tolerations for Ray pods of this podType (the head's podType in this case)
        #   ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
        #   Note that it is often not necessary to manually specify tolerations for GPU
        #   usage on managed platforms such as AKS, EKS, and GKE.
        #   ref: https://docs.ray.io/en/master/cluster/kubernetes-gpu.html
        tolerations: []
        # - key: "nvidia.com/gpu"
        #   operator: Exists
        #   effect: NoSchedule

    rayWorkerType_default:
        # minWorkers is the minimum number of Ray workers of this pod type to keep running.
        minWorkers: 0
        # maxWorkers is the maximum number of Ray workers of this pod type to which Ray will scale.
        maxWorkers: 4
        # memory is the memory used by this Pod type.
        # (Used for both requests and limits.)
        memory: 56000Mi
        # CPU is the number of CPUs used by this pod type.
        # (Used for both requests and limits. Must be an integer, as Ray does not support fractional CPUs.)
        CPU: 15
        # GPU is the number of NVIDIA GPUs used by this pod type.
        # (Optional, requires GPU nodes with appropriate setup. See https://docs.ray.io/en/master/cluster/kubernetes-gpu.html)
        GPU: 0
        # rayResources is an optional string-int mapping signalling additional resources to Ray.
        # "CPU", "GPU", and "memory" are filled automatically based on the above settings, but can be overriden;
        # For example, rayResources: {"CPU": 0} can be used in the head podType to prevent Ray from scheduling tasks on the head.
        # See https://docs.ray.io/en/master/advanced.html#dynamic-remote-parameters for an example of usage of custom resources in a Ray task.
        rayResources: [] #{"PODTYPE_VLARGE": 15}
        # Optionally, set a node selector for this Pod type. See https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
        nodeSelector: {"agentpool": "raypool"}

        # tolerations for Ray pods of this podType
        #   ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
        #   Note that it is often not necessary to manually specify tolerations for GPU
        #   usage on managed platforms such as AKS, EKS, and GKE.
        #   ref: https://docs.ray.io/en/master/cluster/kubernetes-gpu.html
        tolerations: 
        - key: kubernetes.azure.com/scalesetpriority
          operator: "Equal"
          value: "spot"
          effect: "NoSchedule"
        # - key: nvidia.com/gpu
        #   operator: Exists
        #   effect: NoSchedule

    rayWorkerType_large:
        # minWorkers is the minimum number of Ray workers of this pod type to keep running.
        minWorkers: 0
        # maxWorkers is the maximum number of Ray workers of this pod type to which Ray will scale.
        maxWorkers: 4
        # memory is the memory used by this Pod type.
        # (Used for both requests and limits.)
        memory: 110000Mi
        # CPU is the number of CPUs used by this pod type.
        # (Used for both requests and limits. Must be an integer, as Ray does not support fractional CPUs.)
        CPU: 31
        # GPU is the number of NVIDIA GPUs used by this pod type.
        # (Optional, requires GPU nodes with appropriate setup. See https://docs.ray.io/en/master/cluster/kubernetes-gpu.html)
        GPU: 0
        # rayResources is an optional string-int mapping signalling additional resources to Ray.
        # "CPU", "GPU", and "memory" are filled automatically based on the above settings, but can be overriden;
        # For example, rayResources: {"CPU": 0} can be used in the head podType to prevent Ray from scheduling tasks on the head.
        # See https://docs.ray.io/en/master/advanced.html#dynamic-remote-parameters for an example of usage of custom resources in a Ray task.
        rayResources: [] # {"PODTYPE_VLARGE": 15}
        # Optionally, set a node selector for this Pod type. See https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
        nodeSelector: {"agentpool": "raypoollarge"}

        # tolerations for Ray pods of this podType
        #   ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
        #   Note that it is often not necessary to manually specify tolerations for GPU
        #   usage on managed platforms such as AKS, EKS, and GKE.
        #   ref: https://docs.ray.io/en/master/cluster/kubernetes-gpu.html
        tolerations: 
        - key: kubernetes.azure.com/scalesetpriority
          operator: "Equal"
          value: "spot"
          effect: "NoSchedule"
        # - key: nvidia.com/gpu
        #   operator: Exists
        #   effect: NoSchedule

    rayWorkerType_small:
        # minWorkers is the minimum number of Ray workers of this pod type to keep running.
        minWorkers: 0
        # maxWorkers is the maximum number of Ray workers of this pod type to which Ray will scale.
        maxWorkers: 8
        # memory is the memory used by this Pod type.
        # (Used for both requests and limits.)
        memory: 28000Mi
        # CPU is the number of CPUs used by this pod type.
        # (Used for both requests and limits. Must be an integer, as Ray does not support fractional CPUs.)
        CPU: 7
        # GPU is the number of NVIDIA GPUs used by this pod type.
        # (Optional, requires GPU nodes with appropriate setup. See https://docs.ray.io/en/master/cluster/kubernetes-gpu.html)
        GPU: 0
        # rayResources is an optional string-int mapping signalling additional resources to Ray.
        # "CPU", "GPU", and "memory" are filled automatically based on the above settings, but can be overriden;
        # For example, rayResources: {"CPU": 0} can be used in the head podType to prevent Ray from scheduling tasks on the head.
        # See https://docs.ray.io/en/master/advanced.html#dynamic-remote-parameters for an example of usage of custom resources in a Ray task.
        rayResources: [] #{"PODTYPE_VLARGE": 15}
        # Optionally, set a node selector for this Pod type. See https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
        nodeSelector: {"agentpool": "raypoolsmall"}

        # tolerations for Ray pods of this podType
        #   ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
        #   Note that it is often not necessary to manually specify tolerations for GPU
        #   usage on managed platforms such as AKS, EKS, and GKE.
        #   ref: https://docs.ray.io/en/master/cluster/kubernetes-gpu.html
        tolerations: 
        - key: kubernetes.azure.com/scalesetpriority
          operator: "Equal"
          value: "spot"
          effect: "NoSchedule"
        # - key: nvidia.com/gpu
        #   operator: Exists
        #   effect: NoSchedule

ecm200 avatar May 11 '22 18:05 ecm200

A Kubernetes based helm chart (i.e. non kuberay)

KubeRay is also Kubernetes-based :) Just a different implementation targeting similar problems.

agent pools of 8, 16, 32 and 64 cpus instances

That makes from a cost perspective! On the other hand, you may get better performance running a few big Ray nodes vs. many small ones -- with a few big nodes, there's less network overhead and fewer local Ray control structures running. The goal is to find the right balance :)

DmitriGekhtman avatar May 11 '22 19:05 DmitriGekhtman

KubeRay is also Kubernetes-based :) Just a different implementation targeting similar problems.

@DmitriGekhtman you are indeed most correct sir, it was a poor description by myself. I was attempting to differentiate from the kuberay implementation. :-)

You make some good points on the size of nodes!

ecm200 avatar May 11 '22 19:05 ecm200

@DmitriGekhtman was wondering if something like this would work:

Modify the workerGroupSpecs section (as you suggested) of the raycluster-cluster.yaml template of the helm chart for kuberay ray-cluster to accept a nested structure. The section in the values.yaml will be named workers and then each section defines the name of the worker group to be defined.

The template is as follows:

  workerGroupSpecs:
  {{- range $key, $val := .Values.workers }}
  - groupName: {{ $val.groupName }}
    rayStartParams:
    {{- range $key, $val := $val.initArgs }}
      {{ $key }}: {{ $val | quote }}
    {{- end }}
    replicas: {{ $val.replicas }}
    minReplicas: {{ $val.miniReplicas | default 1 }}
    maxReplicas: {{ $val.maxiReplicas | default 2147483647 }}
    template:
      spec:
        imagePullSecrets: {{- toYaml .Values.imagePullSecrets | nindent 10 }}
        initContainers:
          - name: init-myservice
            image: busybox:1.28
            command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
        containers:
          - volumeMounts: {{- toYaml $val.volumeMounts | nindent 12 }}
            name: ray-worker
            image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
            imagePullPolicy: {{ .Values.image.pullPolicy }}
            resources: {{- toYaml $val.resources | nindent 14 }}
            env:
              - name: TYPE
                value: worker
            {{- toYaml $val.containerEnv | nindent 14}}
            {{- with $val.envFrom }}
            envFrom: {{- toYaml . | nindent 14}}
            {{- end }}
            ports: {{- toYaml $val.ports | nindent 14}}
        volumes: {{- toYaml $val.volumes | nindent 10 }}
        affinity: {{- toYaml $val.affinity | nindent 10 }}
        tolerations: {{- toYaml $val.tolerations | nindent 10 }}
        nodeSelector: {{- toYaml $val.nodeSelector | nindent 10 }}
      metadata:
        annotations: {{- toYaml $val.annotations | nindent 10 }}
        labels:
          rayNodeType: {{ $val.type }}
          groupName: {{ $val.groupName }}
          rayCluster: {{ include "ray-cluster.fullname" . }}
  {{- end }} 

An example of the values.yaml of the helm chart would be as follows below.

Here I have modified the resources section to match the number of CPUs on the instance, where I interpret resources.limits.cpu to be the total number of cpu units available on that worker type, and resources.requests.cpu to be the minimum increment cpu unit allowed to be requested, did I get that right?

To match up the worker type specification to the correct compute instance type, the nodeSelector field is used to select which node pool this type of worker is allowed to be deployed to.

workers:
  # First worker group type specification
  worker_type_default:
    groupName: workergroupdefault
    replicas: 1
    type: worker
    labels:
      key: value
    initArgs:
      node-ip-address: $MY_POD_IP
      redis-password: LetMeInRay
      block: 'true'
    containerEnv:
      - name: MY_POD_IP
        valueFrom:
          fieldRef:
            fieldPath: status.podIP
      - name: RAY_DISABLE_DOCKER_CPU_WARNING
        value: "1"
      - name: CPU_REQUEST
        valueFrom:
          resourceFieldRef:
            containerName: ray-worker
            resource: requests.cpu
      - name: MY_POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
    envFrom: []
      # - secretRef:
      #     name: my-env-secret
    ports:
      - containerPort: 80
        protocol: TCP
    resources: 
      limits:
        cpu: 15
      requests:
        cpu: 1
    annotations:
      key: value
    nodeSelector: {"agentpool": "raypool"}
    tolerations:
    - key: kubernetes.azure.com/scalesetpriority
      operator: "Equal"
      value: "spot"
      effect: "NoSchedule"
    affinity: {}
    volumes:
      - name: log-volume
        emptyDir: {}
    volumeMounts:
      - mountPath: /tmp/ray
        name: log-volume
  
  # Second worker group type specification
  worker_type_small:
    groupName: workergroupsmall
    replicas: 1
    type: worker
    labels:
      key: value
    initArgs:
      node-ip-address: $MY_POD_IP
      redis-password: LetMeInRay
      block: 'true'
    containerEnv:
      - name: MY_POD_IP
        valueFrom:
          fieldRef:
            fieldPath: status.podIP
      - name: RAY_DISABLE_DOCKER_CPU_WARNING
        value: "1"
      - name: CPU_REQUEST
        valueFrom:
          resourceFieldRef:
            containerName: ray-worker
            resource: requests.cpu
      - name: MY_POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
    envFrom: []
      # - secretRef:
      #     name: my-env-secret
    ports:
      - containerPort: 80
        protocol: TCP
    resources: 
      limits:
        cpu: 7
      requests:
        cpu: 1
    annotations:
      key: value
    nodeSelector: {"agentpool": "raypoolsmall"}
    tolerations:
    - key: kubernetes.azure.com/scalesetpriority
      operator: "Equal"
      value: "spot"
      effect: "NoSchedule"
    affinity: {}
    volumes:
      - name: log-volume
        emptyDir: {}
    volumeMounts:
      - mountPath: /tmp/ray
        name: log-volume

  # Third worker group type specification
  worker_type_large:
    groupName: workergrouplarge
    replicas: 1
    type: worker
    labels:
      key: value
    initArgs:
      node-ip-address: $MY_POD_IP
      redis-password: LetMeInRay
      block: 'true'
    containerEnv:
      - name: MY_POD_IP
        valueFrom:
          fieldRef:
            fieldPath: status.podIP
      - name: RAY_DISABLE_DOCKER_CPU_WARNING
        value: "1"
      - name: CPU_REQUEST
        valueFrom:
          resourceFieldRef:
            containerName: ray-worker
            resource: requests.cpu
      - name: MY_POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
    envFrom: []
      # - secretRef:
      #     name: my-env-secret
    ports:
      - containerPort: 80
        protocol: TCP
    resources: 
      limits:
        cpu: 31
      requests:
        cpu: 1
    annotations:
      key: value
    nodeSelector: {"agentpool": "raypoollarge"}
    tolerations:
    - key: kubernetes.azure.com/scalesetpriority
      operator: "Equal"
      value: "spot"
      effect: "NoSchedule"
    affinity: {}
    volumes:
      - name: log-volume
        emptyDir: {}
    volumeMounts:
      - mountPath: /tmp/ray
        name: log-volume

ecm200 avatar May 11 '22 19:05 ecm200

I also forgot to mention, in the above example, I added tolerations as I tend to use spot instances on Azure to save cost. These weren't needed for demonstration.

ecm200 avatar May 11 '22 19:05 ecm200

@DmitriGekhtman I put an initial PR (#264) up above, but I couldn't add you as a reviewer as I think I don't have permissions.

ecm200 avatar May 16 '22 13:05 ecm200

This has been solved.

DmitriGekhtman avatar Nov 18 '22 19:11 DmitriGekhtman