antrea
antrea copied to clipboard
Bugs when a Pod is with multiple SR-IOV network annotations
There are three issues I noticed when I tried to deploy two Pods with two SR-IOV interfaces with the following manifest, the two extra interfaces can be created correctly. But the order of the VF binding might be different in two Pods.
- Issue 1
Let's say there are two VF named ens64 with VLAN tag '1167' and ens65 with VLAN tag '1168' attached in the K8s Nodes, in Pod pod-a-2nics, Antrea may create eth1 and bind VF ens64 to it, but in Pod pod-b-2nics, it may create 'eth1' and bind VF ens65 to it. This will cause a connection problem between two Pods via the specific interface to communicate each other because of the VLAN tag mismatched.
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: sriov-net-1
annotations:
k8s.v1.cni.cncf.io/resourceName: broadcom.com/brcm_sriov_bnxt
spec:
config: '{
"cniVersion": "0.3.0",
"type": "antrea",
"networkType": "sriov",
"vlan": 1168,
"ipam": {
"type": "antrea",
"ippools": ["pool1"]
}
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: sriov-net-2
annotations:
k8s.v1.cni.cncf.io/resourceName: broadcom.com/brcm_sriov_bnxt
spec:
config: '{
"cniVersion": "0.3.0",
"type": "antrea",
"networkType": "sriov",
"vlan": 1167,
"ipam": {
"type": "antrea",
"ippools": ["pool2"]
}
}'
---
apiVersion: v1
kind: Pod
metadata:
name: pod-a-2nics
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{ "name" : "sriov-net-1", "interface": "eth1" },
{ "name" : "sriov-net-2", "interface": "eth2" }
]
spec:
nodeName: tea-workers-1
containers:
- name: netshoot
image: "nicolaka/netshoot"
command: ['tail', '-f', '/dev/null']
imagePullPolicy: IfNotPresent
resources:
requests:
broadcom.com/brcm_sriov_bnxt: "2"
limits:
broadcom.com/brcm_sriov_bnxt: "2"
---
apiVersion: v1
kind: Pod
metadata:
name: pod-b-2nics
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{ "name" : "sriov-net-1", "interface": "eth1" },
{ "name" : "sriov-net-2", "interface": "eth2" }
]
spec:
nodeName: tea-workers-2
containers:
- name: netshoot
image: "nicolaka/netshoot"
command: ['tail', '-f', '/dev/null']
imagePullPolicy: IfNotPresent
resources:
requests:
broadcom.com/brcm_sriov_bnxt: "2"
limits:
broadcom.com/brcm_sriov_bnxt: "2"
- Issue 2
When these two Pods are deleted, two VFs are un-allocated and back to Nodes interfaces pool. The interface name is no longer the same as before
3: ens64: <BROADCAST,MULTICAST> mtu 1500 qdisc prio state DOWN group default qlen 1000
link/ether 00:50:56:9d:48:43 brd ff:ff:ff:ff:ff:ff
altname enp3s0
4: ens65: <BROADCAST,MULTICAST> mtu 1500 qdisc prio state DOWN group default qlen 1000
link/ether 00:50:56:9d:87:ef brd ff:ff:ff:ff:ff:ff
altname enp3s1
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc prio state DOWN group default qlen 1000
link/ether 00:50:56:9d:48:43 brd ff:ff:ff:ff:ff:ff
altname enp3s0
altname ens64
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc prio state DOWN group default qlen 1000
link/ether 00:50:56:9d:87:ef brd ff:ff:ff:ff:ff:ff
altname enp3s1
altname ens65
- Issue 3
When the VF interfaces are changed to eth1, eth2 as above, there will be another VF allocation issue when users are trying to create Pods with two SR-IOV interface. The reason is that the default interface name in Pod is 'eth1', 'eth2' etc, when the Antrea successfully allocate one VF to the Pod and trying to allocate second VF to the Pod again, it may conflict with existing interface:
E0417 05:41:08.764366 1 controller.go:462] "Secondary interface configuration failed" err="SRIOV Interface creation failed: failed to move VF eth1 to container netns /host/var/run/netns/cni-f1ac076d-060e-01e5-6b4f-72c297886168: failed to move VF device eth1 to netns: \"file exists\"" Pod="kube-system/pod-a-2nics" interface="eth2" networkType="sriov"
cc @tnqn @antoninbas @jianjuns
@tnqn I suppose the first issue might be resolved if the SR-IOV interfaces are in truck mode and Antrea-agent do the VLAN tag? https://github.com/antrea-io/antrea/issues/7110
Issue 1 may be invalid usage if I understand it correctly. First, providing the vlan value here doesn't do anything currently (as you pointed out this is dependent on #7110). Then, if you want a specific VF to be used, because the 2 matching VFs are on different networks, different resource names should be used (cannot be broadcom.com/brcm_sriov_bnxt for both). You would have to define 2 different resources and use the deviceID selector in the resource definition. I am not aware of any standard alternative.
I suppose the first issue might be resolved if the SR-IOV interfaces are in truck mode and Antrea-agent do the VLAN tag?
That sounds correct to me, but here you are relying on the fact that it doesn't matter anymore which VF is mapped to which Pod interface. I think you can have a situation where ens64 and ens65 map to different PFs and / or are on different isolated networks.
Issue 2 (and therefore issue 3) looks like a bug on our side. A good reference is the host-device plugin which is essentially doing what we are doing in this case. When moving the interface back to the host netns, they restore the name based on the alias: https://github.com/containernetworking/plugins/blob/d0d20a9e2203ba462e6d8251072ff54595b3b469/plugins/main/host-device/host-device.go#L424-L429
VLAN ID is for VLAN network type only. Even if you specify VLAN ID with SRIOV network type, Antrea will not do anything with it.
Agreed we should fix issue 2 and 3, and restore the original interface name (just like what we do when restoring OVS uplink interfaces back).
VLAN ID is for VLAN network type only. Even if you specify VLAN ID with SRIOV network type, Antrea will not do anything with it.
Quan opened #7110 to configure the VLAN for the VF, presumably by calling https://pkg.go.dev/github.com/vishvananda/netlink#LinkSetVfVlan from the Antrea Agent.
Oh ok. I did not know that. Sure, tagging can be a useful feature.
@antoninbas yes, there are actually some limitations when the SRIOV-plugin detect VFs on VM. https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/tree/master?tab=readme-ov-file#configure-device-plugin-extended-selectors-in-virtual-environments, we need to leverage pciAddress as a selector to create different resources when the interfaces are in different VLANs.
The first issue is considered as invalid one, another two are resolved by PR 7144.