torchx Add Support for NVIDIA Network Operator to the Kubernetes Scheduler

Description

Provide a way to use the NVIDIA Network Operator through the CLI and API of the Kubernetes scheduler.

Motivation/Background

The NVIDIA Network Operator enables RDMA devices and other fast networking components to be used in containerized environments. Fast networking is critical for the performance of workloads that span multiple nodes.

The network operator could provide access to RDMA devices by using either a MacvlanNetwork or a HostDeviceNetwork. This example shows how a pod can be attached to a MacVlanNetwork, and this example shows how a pod can be attached to a HostDeviceNetwork. In either case, the critical parts are:

the resource being requested (e.g., rdma/rdma_shared_device_a: 1 or nvidia.com/hostdev: '1')
the IPC_LOCK security context capability
the annotation that specifies the particular network to attach to (e.g., k8s.v1.cni.cncf.io/networks: rdma-net-ipam)

Detailed Proposal

Before detailing a specific proposal, I'd like to hear from the team about how feasible this sounds so far and whether any existing facilities might already help with some of this.

Jan 02 '24 16:01 benash

Thanks for the proposal. I think this is something worth adding. Happy to take a look at the PR if you're up to contribute!

the resource being requested (e.g., rdma/rdma_shared_device_a: 1 or nvidia.com/hostdev: '1')

How coupled are these to host types?

One way to achieve this today is by defining your custom named resource (see docs) and adding the resources/devices as part of torchx.specs.api.Resource#devices map. For instance:

from torchx.specs.api import Resource

def my_custom_machine_type() -> Resource:
   return Resource(
      cpu = ...,
     memMB = ...,
     gpu = ...,
     devices = {
           "rdma/rdma_shared_device_a" : "1",
          "nvidia.com/hostdev": "1",
     }

Take a look at the aws named resources as an example.

The kubernetes_scheduler will translate the Resource.devices field into the container's resource limit in the k8s pod spec (see code). Note that currently we only set resource.limit and not resource.request (as shown in the example you mentioned) so this wouldn't work if you were stacking containers on a single host. That said, since RDMA is involved I'm assuming you're interested in distributed training/inference, and for distributed workloads (especially on GPUs) I haven't seen folks want to stack a container (e.g. one for each GPU), instead they'd have one container per host and use torchrun as the entrypoint to create one process per GPU (within the container).

the IPC_LOCK security context capability

the kubernetes scheduler in torchx runs with privileged = True by default (see code). In this case, is this still needed?

the annotation that specifies the particular network to attach to (e.g., k8s.v1.cni.cncf.io/networks: rdma-net-ipam)

How coupled are these to host types? (e.g. how often would a user need to configure this?)

If the answer is "per-job" then we'd want to add this as part of the scheduler arguments. Otherwise, we can add the capability to take specific keys from torchx.specs.api.Resources.capabilities , for example:

Resource(
   ...,
   capabilities = {
      "k8s.pod.metadata.annotations": {
           "k8s.v1.cni.cncf.io/networks": "rdma-net-ipam"
     },
   },
)

And in kubernetes_scheduler.py:role_to_pod we can read role.resource.capabilities["k8s.pod.metadata"] and stick it into the Pod spec's metadata.annotations.

Jan 03 '24 23:01 kiukchung

Great, thanks for the feedback and pointers. I'd be open to putting together a PR if this seems the right thing to do.

the resource being requested (e.g., rdma/rdma_shared_device_a: 1 or nvidia.com/hostdev: '1')

How coupled are these to host types?

I'm not sure what you mean by host types in this context, but the use-cases I'm interested in would always involve multiple hosts (K8s nodes) that are networked together for via RDMA. The two network types, HostDeviceNetwork and MacVlanNetwork, are both CRDs that the Network Operator provides that would need to be created before launching the TorchX job, and either one should be able to be used on any such RDMA-capable host. The main difference between using the two network types that I can see is that attaching to a HostDeviceNetwork (and requesting an nvidia.com/hostdev) allocates the complete access of the RDMA device to the pod, whereas attaching to a MacVlanNetwork (and requesting an rdma/my_shared_rdma_device) does not exclude other pods from subsequently accessing and sharing that device.

One way to achieve this today is by defining your custom named resource

Great, I'll try adding a custom resource in this way.

Note that currently we only set resource.limit and not resource.request (as shown in the example you mentioned) so this wouldn't work if you were stacking containers on a single host.

We've been using one container per node as well, so I think that assumption is OK to continue with.

the IPC_LOCK security context capability

the kubernetes scheduler in torchx runs with privileged = True by default (see code). In this case, is this still needed?

OK, then I don't think the annotation would be needed for K8s, but has OCP (OpenShift) has ever been tested/supported? That's one platform supported by the GPU Operator/Network Operator that's restricted security-wise.

the annotation that specifies the particular network to attach to (e.g., k8s.v1.cni.cncf.io/networks: rdma-net-ipam)

How coupled are these to host types? (e.g. how often would a user need to configure this?)

If the answer is "per-job" then we'd want to add this as part of the scheduler arguments. Otherwise, we can add the capability to take specific keys from torchx.specs.api.Resources.capabilities

It wouldn't change from job to job. The value is the name of network you'd like the pod to attach to, e.g., what you see from kubectl get hostdevicenetworks. And that entity would have been created previously and would remain after the job is done. So I'll give this method a try as well.

Jan 05 '24 17:01 benash

Great, looking forward to the PR!

Since you're going to be registering custom named resources (this would be in your project), the only thing that would need to be done in the PR is to add the metadata.annotation to kubernetes_scheduler.py.

Follow ups below:

I'm not sure what you mean by host types in this context

I was referring to a physical machine type.

Looks like these are semi-static "resource" configurations (e.g. once you set up a couple of resource definitions, you can reuse them for the jobs you launch onto the cluster until the cluster's resources change - new host types added, deprecated ones removed, etc). So defining a few "named resources" that the user can select when launching the job would work nicely here.

Note that while torchx currently requires the named resource static factory methods to be no-argument, if you had a valid use-case for a dynamic resource parameter, you could set this in the custom component (e.g. write your own version of dist.ddp).

In the context of k8s, the torchx named resource would map to an enumeration of the most commonly used container.resource spec (one for each). Since most users run a single container per machine, a physical machine type in the k8s cluster dictates the container.resource spec, therefore there exists a 1:1 mapping between a torchx named resource <> machine types in the cluster.

But, as in your case, there are valid use-cases where the user might want to define more than one named resource for a specific machine type in the cluster. And this is what I meant by "host type".

but has OCP (OpenShift) has ever been tested/supported?

AFAIK no. I'm not too familiar with OCP but skimming over their docs it seems like we'd have to implement a torchx.schedulers.open_shift_scheduler.py (similar to the kubernetes_mcad_scheduler.py) since the kubernetes_scheduler in torchx today uses Volcano to do gang scheduling.

Jan 05 '24 17:01 kiukchung

but has OCP (OpenShift) has ever been tested/supported?

AFAIK no. I'm not too familiar with OCP but skimming over their docs it seems like we'd have to implement a torchx.schedulers.open_shift_scheduler.py (similar to the kubernetes_mcad_scheduler.py) since the kubernetes_scheduler in torchx today uses Volcano to do gang scheduling.

OK. If we get there in the future, the Volcano FAQ mentions some modifications needed to run on OCP.

Jan 05 '24 20:01 benash

With the network operator, when we configure a secondary network like this macVlanNetwork, we include IPAM info:

apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
  name: macvlannetwork
spec:
  networkNamespace: "default"
  master: "enp141s0f0np0"
  mode: "bridge"
  mtu: 1500
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.2.225/28",
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info",
      "gateway": "192.168.2.1"
    }

The attached pods are then configured with an additional interface called net1 whose IP address corresponds to the range defined by the secondary network:

[root@pod2 /]# ip -4 addr show
. . .
4: eth0@if800 < . . . >
    inet 192.168.32.54/32 scope global eth0
5: net1@if26: < . . . >
    inet 192.168.2.226/28 brd 192.168.2.239 scope global net1

However, it seems that Volcano defines its service endpoints with addresses of the primary interfaces (ones that are not on the RDMA network). And TorchX then gives the MASTER_ADDR as an FQDN like bert-v56fmjp7qkbzpc-bert-0-0.bert-v56fmjp7qkbzpc.default.svc.cluster.local, which matches up with the defined endpoint.

Does that problem description make sense? What I'd like is for the service endpoints to use the IP addresses of the secondary RDMA-based network, not the primary one.

I will also ask the Network Operator team about this use-case.

Jan 08 '24 21:01 benash

It looks like TorchX facilitates communication between nodes by spinning up a K8s service and passing the name of the master pod's endpoint in MASTER_ADDR. However, according to the Network Operator team:

Currently, service name resolution for secondary networks is not supported by the network operator.

I believe this means that we'll have to pass in the IP address of the master pod a different way. How can we manually set MASTER_ADDR?

Jan 12 '24 20:01 benash

I believe this means that we'll have to pass in the IP address of the master pod a different way. How can we manually set MASTER_ADDR?

It would be more in line with kuberenetes design principals to make the change in the network operator to support name resolution. However, in case you still want to pass it in manually until that change is made, rdzv_endpoint is defined at the component level for dist.py, so when creating your own component you could specify a parameter for a static ip and pass it there.

You can see torchx/components/dist.py for an example on how the rdzv is set up there.

Jan 16 '24 13:01 ccharest93

torchx torchx copied to clipboard

Add Support for NVIDIA Network Operator to the Kubernetes Scheduler

Description

Motivation/Background

Detailed Proposal

torchx
torchx copied to clipboard