torchx
torchx copied to clipboard
Add Support for NVIDIA Network Operator to the Kubernetes Scheduler
Description
Provide a way to use the NVIDIA Network Operator through the CLI and API of the Kubernetes scheduler.
Motivation/Background
The NVIDIA Network Operator enables RDMA devices and other fast networking components to be used in containerized environments. Fast networking is critical for the performance of workloads that span multiple nodes.
The network operator could provide access to RDMA devices by using either a MacvlanNetwork or a HostDeviceNetwork. This example shows how a pod can be attached to a MacVlanNetwork, and this example shows how a pod can be attached to a HostDeviceNetwork. In either case, the critical parts are:
- the resource being requested (e.g.,
rdma/rdma_shared_device_a: 1
ornvidia.com/hostdev: '1'
) - the
IPC_LOCK
security context capability - the annotation that specifies the particular network to attach to (e.g.,
k8s.v1.cni.cncf.io/networks: rdma-net-ipam
)
Detailed Proposal
Before detailing a specific proposal, I'd like to hear from the team about how feasible this sounds so far and whether any existing facilities might already help with some of this.
Thanks for the proposal. I think this is something worth adding. Happy to take a look at the PR if you're up to contribute!
the resource being requested (e.g., rdma/rdma_shared_device_a: 1 or nvidia.com/hostdev: '1')
How coupled are these to host types?
One way to achieve this today is by defining your custom named resource (see docs) and adding the resources/devices as part of torchx.specs.api.Resource#devices
map. For instance:
from torchx.specs.api import Resource
def my_custom_machine_type() -> Resource:
return Resource(
cpu = ...,
memMB = ...,
gpu = ...,
devices = {
"rdma/rdma_shared_device_a" : "1",
"nvidia.com/hostdev": "1",
}
Take a look at the aws named resources as an example.
The kubernetes_scheduler
will translate the Resource.devices
field into the container's resource limit in the k8s pod spec (see code). Note that currently we only set resource.limit
and not resource.request
(as shown in the example you mentioned) so this wouldn't work if you were stacking containers on a single host. That said, since RDMA is involved I'm assuming you're interested in distributed training/inference, and for distributed workloads (especially on GPUs) I haven't seen folks want to stack a container (e.g. one for each GPU), instead they'd have one container per host and use torchrun
as the entrypoint to create one process per GPU (within the container).
the IPC_LOCK security context capability
the kubernetes scheduler in torchx runs with privileged = True
by default (see code). In this case, is this still needed?
the annotation that specifies the particular network to attach to (e.g., k8s.v1.cni.cncf.io/networks: rdma-net-ipam)
How coupled are these to host types? (e.g. how often would a user need to configure this?)
If the answer is "per-job" then we'd want to add this as part of the scheduler arguments.
Otherwise, we can add the capability to take specific keys from torchx.specs.api.Resources.capabilities
, for example:
Resource(
...,
capabilities = {
"k8s.pod.metadata.annotations": {
"k8s.v1.cni.cncf.io/networks": "rdma-net-ipam"
},
},
)
And in kubernetes_scheduler.py:role_to_pod
we can read role.resource.capabilities["k8s.pod.metadata"]
and stick it into the Pod spec's metadata.annotations
.
Great, thanks for the feedback and pointers. I'd be open to putting together a PR if this seems the right thing to do.
the resource being requested (e.g., rdma/rdma_shared_device_a: 1 or nvidia.com/hostdev: '1')
How coupled are these to host types?
I'm not sure what you mean by host types in this context, but the use-cases I'm interested in would always involve multiple hosts (K8s nodes) that are networked together for via RDMA. The two network types, HostDeviceNetwork
and MacVlanNetwork
, are both CRDs that the Network Operator provides that would need to be created before launching the TorchX job, and either one should be able to be used on any such RDMA-capable host. The main difference between using the two network types that I can see is that attaching to a HostDeviceNetwork
(and requesting an nvidia.com/hostdev
) allocates the complete access of the RDMA device to the pod, whereas attaching to a MacVlanNetwork
(and requesting an rdma/my_shared_rdma_device
) does not exclude other pods from subsequently accessing and sharing that device.
One way to achieve this today is by defining your custom named resource
Great, I'll try adding a custom resource in this way.
Note that currently we only set
resource.limit
and notresource.request
(as shown in the example you mentioned) so this wouldn't work if you were stacking containers on a single host.
We've been using one container per node as well, so I think that assumption is OK to continue with.
the IPC_LOCK security context capability
the kubernetes scheduler in torchx runs with
privileged = True
by default (see code). In this case, is this still needed?
OK, then I don't think the annotation would be needed for K8s, but has OCP (OpenShift) has ever been tested/supported? That's one platform supported by the GPU Operator/Network Operator that's restricted security-wise.
the annotation that specifies the particular network to attach to (e.g., k8s.v1.cni.cncf.io/networks: rdma-net-ipam)
How coupled are these to host types? (e.g. how often would a user need to configure this?)
If the answer is "per-job" then we'd want to add this as part of the scheduler arguments. Otherwise, we can add the capability to take specific keys from
torchx.specs.api.Resources.capabilities
It wouldn't change from job to job. The value is the name of network you'd like the pod to attach to, e.g., what you see from kubectl get hostdevicenetworks
. And that entity would have been created previously and would remain after the job is done. So I'll give this method a try as well.
Great, looking forward to the PR!
Since you're going to be registering custom named resources (this would be in your project), the only thing that would need to be done in the PR is to add the metadata.annotation
to kubernetes_scheduler.py
.
Follow ups below:
I'm not sure what you mean by host types in this context
I was referring to a physical machine type.
Looks like these are semi-static "resource" configurations (e.g. once you set up a couple of resource definitions, you can reuse them for the jobs you launch onto the cluster until the cluster's resources change - new host types added, deprecated ones removed, etc). So defining a few "named resources" that the user can select when launching the job would work nicely here.
Note that while torchx currently requires the named resource static factory methods to be no-argument, if you had a valid use-case for a dynamic resource parameter, you could set this in the custom component (e.g. write your own version of dist.ddp
).
In the context of k8s, the torchx named resource would map to an enumeration of the most commonly used container.resource
spec (one for each). Since most users run a single container per machine, a physical machine type in the k8s cluster dictates the container.resource
spec, therefore there exists a 1:1 mapping between a torchx named resource <> machine types in the cluster.
But, as in your case, there are valid use-cases where the user might want to define more than one named resource for a specific machine type in the cluster. And this is what I meant by "host type".
but has OCP (OpenShift) has ever been tested/supported?
AFAIK no. I'm not too familiar with OCP but skimming over their docs it seems like we'd have to implement a torchx.schedulers.open_shift_scheduler.py
(similar to the kubernetes_mcad_scheduler.py
) since the kubernetes_scheduler
in torchx today uses Volcano to do gang scheduling.
but has OCP (OpenShift) has ever been tested/supported?
AFAIK no. I'm not too familiar with OCP but skimming over their docs it seems like we'd have to implement a
torchx.schedulers.open_shift_scheduler.py
(similar to thekubernetes_mcad_scheduler.py
) since thekubernetes_scheduler
in torchx today uses Volcano to do gang scheduling.
OK. If we get there in the future, the Volcano FAQ mentions some modifications needed to run on OCP.
With the network operator, when we configure a secondary network like this macVlanNetwork
, we include IPAM info:
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: macvlannetwork
spec:
networkNamespace: "default"
master: "enp141s0f0np0"
mode: "bridge"
mtu: 1500
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"range": "192.168.2.225/28",
"log_file" : "/var/log/whereabouts.log",
"log_level" : "info",
"gateway": "192.168.2.1"
}
The attached pods are then configured with an additional interface called net1
whose IP address corresponds to the range defined by the secondary network:
[root@pod2 /]# ip -4 addr show
. . .
4: eth0@if800 < . . . >
inet 192.168.32.54/32 scope global eth0
5: net1@if26: < . . . >
inet 192.168.2.226/28 brd 192.168.2.239 scope global net1
However, it seems that Volcano defines its service endpoints with addresses of the primary interfaces (ones that are not on the RDMA network). And TorchX then gives the MASTER_ADDR
as an FQDN like bert-v56fmjp7qkbzpc-bert-0-0.bert-v56fmjp7qkbzpc.default.svc.cluster.local
, which matches up with the defined endpoint.
Does that problem description make sense? What I'd like is for the service endpoints to use the IP addresses of the secondary RDMA-based network, not the primary one.
I will also ask the Network Operator team about this use-case.
It looks like TorchX facilitates communication between nodes by spinning up a K8s service and passing the name of the master pod's endpoint in MASTER_ADDR
. However, according to the Network Operator team:
Currently, service name resolution for secondary networks is not supported by the network operator.
I believe this means that we'll have to pass in the IP address of the master pod a different way. How can we manually set MASTER_ADDR
?
I believe this means that we'll have to pass in the IP address of the master pod a different way. How can we manually set MASTER_ADDR?
It would be more in line with kuberenetes design principals to make the change in the network operator to support name resolution. However, in case you still want to pass it in manually until that change is made, rdzv_endpoint is defined at the component level for dist.py, so when creating your own component you could specify a parameter for a static ip and pass it there.
You can see torchx/components/dist.py for an example on how the rdzv is set up there.