runtime-spec
runtime-spec copied to clipboard
Proposal: Network Devices
The spec describes Devices that are container based, but there are another class of Devices, Network Devices that are defined per namespace, quoting "Linux Device Drivers, Second Edition , Chapter 14. Network Drivers"
Chapter 14. Network Drivers We are now through discussing char and block drivers and are ready to move on to the fascinating world of networking. Network interfaces are the third standard class of Linux devices, and this chapter describes how they interact with the rest of the kernel.
The role of a network interface within the system is similar to that of a mounted block device. A block device registers its features in the blk_dev array and other kernel structures, and it then “transmits” and “receives” blocks on request, by means of its request function. Similarly, a network interface must register itself in specific data structures in order to be invoked when packets are exchanged with the outside world.
There are a few important differences between mounted disks and packet-delivery interfaces. To begin with, a disk exists as a special file in the /dev directory, whereas a network interface has no such entry point. The normal file operations (read, write, and so on) do not make sense when applied to network interfaces, so it is not possible to apply the Unix “everything is a file” approach to them. Thus, network interfaces exist in their own namespace and export a different set of operations.
Although you may object that applications use the read and write system calls when using sockets, those calls act on a software object that is distinct from the interface. Several hundred sockets can be multiplexed on the same physical interface.
But the most important difference between the two is that block drivers operate only in response to requests from the kernel, whereas network drivers receive packets asynchronously from the outside. Thus, while a block driver is asked to send a buffer toward the kernel, the network device asks to push incoming packets toward the kernel. The kernel interface for network drivers is designed for this different mode of operation.
Network drivers also have to be prepared to support a number of administrative tasks, such as setting addresses, modifying transmission parameters, and maintaining traffic and error statistics. The API for network drivers reflects this need, and thus looks somewhat different from the interfaces we have seen so far.
Network Devices are also used for providing connectivity to the network namespaces, and commonly container runtimes use the CNI specification to provide this capacity of adding a network device to the namespace and configure its networking parameters.
Runc already has the concept of network device and how to configure it, in addition to the CNI specifixation https://github.com/opencontainers/runc/tree/main/libcontainer
package configs
// Network defines configuration for a container's networking stack
//
// The network configuration can be omitted from a container causing the
// container to be setup with the host's networking stack
type Network struct {
https://github.com/opencontainers/runc/blob/main/libcontainer/configs/network.go#L3-L51
The spec already has a reference to the network in https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#network , that references network devices, but does not allow to specify the network devices that will be part of the namespace.
However, there are cases that a Kubernetes Pod or container may want to add, in a declarative way, existing Network Devices to the namespace, it is important to mention that the Network Device configuration or creation is non-goal and is left out of the spec on purpose.
The use cases for adding network devices to namespaces are more common lately with the new AI accelerators devices that are presented as network devices to the system, but they are not really considered as an usual network device. Ref: https://lwn.net/Articles/955001/ (Available Jan 4th without subscription)
The proposal is to be able to add existing Network devices to a linux namespace by referencing them https://docs.kernel.org/networking/netdevices.html, in a similar way to the existing definition of Devices
Linux defines an structure like this one in https://man7.org/linux/man-pages/man7/netdevice.7.html
This man page describes the sockets interface which is used to
configure network devices.
Linux supports some standard ioctls to configure network devices.
They can be used on any socket's file descriptor regardless of
the family or type. Most of them pass an ifreq structure:
struct ifreq {
char ifr_name[IFNAMSIZ]; /* Interface name */
union {
struct sockaddr ifr_addr;
struct sockaddr ifr_dstaddr;
struct sockaddr ifr_broadaddr;
struct sockaddr ifr_netmask;
struct sockaddr ifr_hwaddr;
short ifr_flags;
int ifr_ifindex;
int ifr_metric;
int ifr_mtu;
struct ifmap ifr_map;
char ifr_slave[IFNAMSIZ];
char ifr_newname[IFNAMSIZ];
char *ifr_data;
};
};
though we only need the index or the name to be able to reference one interface
Normally, the user specifies which device to affect by setting
ifr_name to the name of the interface or ifr6_ifindex to the
index of the interface. All other members of the structure may
share memory.
## <a name="configLinuxNetDevices" />NetDevices
**`netDevices`** (array of objects, OPTIONAL) lists of network devices that MUST be available in the container network namespace.
Each entry has the following structure:
* **`name`** *(string, REQUIRED)* - name of the network device in the host.
* **`properties`** *(object, OPTIONAL)* - properties the network device per https://man7.org/linux/man-pages/man7/netdevice.7.html in the container namespace.
has the following structure:
* **`name`** *(string, OPTIONAL)* - name of the network device in the network namespace.
* **`address`** *(string, OPTIONAL)* - address of the network device in the network namespace
* **`mask`** *(string, OPTIONAL)* - mask of the network device in the network namespace
* **`mtu`** *(uint16, OPTIONAL)* - MTU size of the network device in the network namespace
### Example
```json
"netdevices": [
{
"name": "eth0",
"properties": {
name: "ns1",
address: "192.168.0.1",
mask: "255.255.255.0",
mtu: 1500,
}
}
},
{
"name": "ens4",
}
]
Proposal: https://github.com/opencontainers/runtime-spec/pull/1240 runc prototype: https://github.com/opencontainers/runc/compare/main...aojea:runc:netdevices?expand=1
References:
- https://docs.kernel.org/networking/netdevices.html
- https://docs.kernel.org/networking/devlink/index.html#interface-documentation
- https://sstar1314.github.io/Linux%20Networking%20Internals3/
- https://www.cs.bilkent.edu.tr/~korpe/courses/cs342-spring2004/linux-kernel/dd/drivers.html
/cc @samuelkarp
Could you explain why CNI can't be extended to support your use case?
I also wonder if OCI hooks can be used.
Could you explain why CNI can't be extended to support your use case?
CNI is about network interface creation and configuration https://github.com/containernetworking/cni/blob/main/SPEC.md#cni-operations
ADD: Add container to network, or apply modifications A CNI plugin, upon receiving an ADD command, should either
create the interface defined by CNI_IFNAME inside the container at CNI_NETNS, or adjust the configuration of the interface defined by CNI_IFNAME inside the container at CNI_NETNS.
CNI is also an implementation detail of container runtimes, and has some limitations, in Kubernetes projects use annotations and different out of band methods to pass this additional information for other interfaces, more on https://github.com/containernetworking/cni/issues/891
In kubernetes, Pods use devices at the container level and it maps to the OCI specification.
I think that most of the problems in this area come because we are trying to conflate network device and network configuration, my proposal is to decouple this, so adding a new field to Pods as netDevice at the Pod level to map the OCI specification, IMHO this will solve elegantly the Pod and container multi-interface problem , leaving to the CNI, the user app or the network plugins the configuration of these netDevices,
Could you explain why CNI can't be extended to support your use case?
CNI is about network interface creation and configuration https://github.com/containernetworking/cni/blob/main/SPEC.md#cni-operations
ADD: Add container to network, or apply modifications A CNI plugin, upon receiving an ADD command, should either create the interface defined by CNI_IFNAME inside the container at CNI_NETNS, or adjust the configuration of the interface defined by CNI_IFNAME inside the container at CNI_NETNS.
CNI is also an implementation detail of container runtimes, and has some limitations, in Kubernetes projects use annotations and different out of band methods to pass this additional information for other interfaces, more on containernetworking/cni#891
In kubernetes, Pods use
devicesat the container level and it maps to the OCI specification.I think that most of the problems in this area come because we are trying to conflate network device and network configuration, my proposal is to decouple this, so adding a new field to Pods as
netDeviceat the Pod level to map the OCI specification, IMHO this will solve elegantly the Pod and container multi-interface problem , leaving to the CNI, the user app or the network plugins the configuration of these netDevices,
I also wonder if OCI hooks can be used.
I'm not well versed in this area, I had this conversation with @samuelkarp , and he thought it was worth at least to open this debate,
I also wonder if OCI hooks can be used.
I think they can. But it moves control from a declarative model (like the rest of the OCI spec) to imperative via the hook implementation. If the goal for the runtime spec is to allow a bundle author to specify the attributes of the container and for a runtime (such as runc) to implement, I do think it'd be nice to include some aspects of networking in that as well.
However, networking is fairly complex. @aojea I'm still not entirely clear on exactly what you'd like to see here (e.g., just interface moves? veth creation? setting up routes? etc). Can you elaborate a bit more?
@aojea I'm still not entirely clear on exactly what you'd like to see here (e.g., just interface moves? veth creation? setting up routes? etc). Can you elaborate a bit more?
just interface moves, being able to reference any netDevice in the host to move into the container network namespace
After spending a few weeks exploring different options, I can find how all these new patterns enabled by the CDI https://github.com/cncf-tags/container-device-interface can benefit Kubernetes and all containers environments of instructing runtimes to move some specific netdevice by name into the runtime namespace, @elezar WDYT?
Right now you have to do an exotic dance between annotations and out of band operations just to get the information to the CNI plugin to be able to move one interface to the network namespace, if the container runtimes can declaratively move the netdivce specified by name into the network namespace, everything will be much simpler
My main use case it to model GPUs and its relation with the high speed NICs used for GPUDirect.
GPU0 GPU1 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity
GPU0 X SYS NODE NODE SYS SYS 0,2,4,6,8,10 0
GPU1 SYS X SYS SYS PHB PHB 1,3,5,7,9,11 1
mlx5_0 NODE SYS X PIX SYS SYS
mlx5_1 NODE SYS PIX X SYS SYS
mlx5_2 SYS PHB SYS SYS X PIX
mlx5_3 SYS PHB SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
It is complex to model this relation in systems like kubernetes, since traditionally NICs are treated as part of the CNI, but in this case, the NICs are only netdevices associated to the GPUs, they are consumed directly by the GPU, and not by the Kubernetes cluster or users.
If the OCI spec support "netdevices", it is possible to use mechanisms like CDI to mutate the OCI spec and add this bundle in a declarative way to the Pod, so an user can create a Pod or a Container requesting one or multiple GPUs, and the https://github.com/cncf-tags/container-device-interface CDI driver can mutate the OCI spec to add the NICs/Netdevices associated, without the users having to do the manual plumbing that is error prone, device drivers can always check the Node topology and assign the best NIC or NICs for each case
cc: @klueska
/cc
/cc
Runc already has the concept of network device and how to configure it, in addition to the CNI specifixation
runc's Network type (part of libcontainer) does not seem to be used by the code related to bundle parsing; it appears (from git blame) to be from January & February 2015, before the OCI was established (and possibly before libcontainer was even factored out of Docker). It does have a fairly decent number of parameters, though it appears to be focused on interface creation (new loopback or veth pair) rather than moves.
Are you proposing that we add libcontainer's Network type to the OCI bundle, or that we add a new structure defining existing host interfaces that are expected to be moved (and possibly renamed) to a container's network namespace?
runc Network type
type Network struct {
// Type sets the networks type, commonly veth and loopback
Type string `json:"type"`
// Name of the network interface
Name string `json:"name"`
// The bridge to use.
Bridge string `json:"bridge"`
// MacAddress contains the MAC address to set on the network interface
MacAddress string `json:"mac_address"`
// Address contains the IPv4 and mask to set on the network interface
Address string `json:"address"`
// Gateway sets the gateway address that is used as the default for the interface
Gateway string `json:"gateway"`
// IPv6Address contains the IPv6 and mask to set on the network interface
IPv6Address string `json:"ipv6_address"`
// IPv6Gateway sets the ipv6 gateway address that is used as the default for the interface
IPv6Gateway string `json:"ipv6_gateway"`
// Mtu sets the mtu value for the interface and will be mirrored on both the host and
// container's interfaces if a pair is created, specifically in the case of type veth
// Note: This does not apply to loopback interfaces.
Mtu int `json:"mtu"`
// TxQueueLen sets the tx_queuelen value for the interface and will be mirrored on both the host and
// container's interfaces if a pair is created, specifically in the case of type veth
// Note: This does not apply to loopback interfaces.
TxQueueLen int `json:"txqueuelen"`
// HostInterfaceName is a unique name of a veth pair that resides on in the host interface of the
// container.
HostInterfaceName string `json:"host_interface_name"`
// HairpinMode specifies if hairpin NAT should be enabled on the virtual interface
// bridge port in the case of type veth
// Note: This is unsupported on some systems.
// Note: This does not apply to loopback interfaces.
HairpinMode bool `json:"hairpin_mode"`
}
Are you proposing that we add libcontainer's Network type to the OCI bundle, or that we add a new structure defining existing host interfaces that are expected to be moved (and possibly renamed) to a container's network namespace?
the later, Network Type and network configuration is just what I want to avoid, is unbounded and contentious ... on the other side, moving host interfaces to container namespaces is IMHO well defined and solves important use cases very easily, my reasoning is that same as block devices are moved into the container namespace, network devices can be moved "declaratively" too, there should be possible to define some of the properties of struct ifreqas name, and address but I really will like to avoid any dynamic configuration ala CNI ... that should be solved at another layer
Specially interesting is the case where some devices have both an RDMA and a Netdevice, this will solve this problem really well, instead of having to split the responsibility of the RDMA device to the OCI runtime and the Netdevice to the CNI, that is always going to be racy
@aojea this might be worth running this by Kata or the other virtualized runtimes.
@aojea this might be worth running this by Kata or the other virtualized runtimes.
are those implementing the OCI runtime spec?
@aojea this might be worth running this by Kata or the other virtualized runtimes.
are those implementing the OCI runtime spec?
Yes the communication between the runtime is via OCI. The CreateTask api in containerd uses the runtime oci spec to communicate with the lower level runtimes. Unless something has changed :-P @mikebrow keep me honest ha!