sriov-network-operator icon indicating copy to clipboard operation
sriov-network-operator copied to clipboard

still need help install sriov-network-operator

Open hymgg opened this issue 1 year ago • 57 comments
trafficstars

Continuing from issue #584,

@adrianchiris Sorry for the late followup.

Install using helm was much easier than following the quick start steps. However, it only brought up the sriov-network-operator pod, according to quick start guide, there should be a sriov-network-config-daemon too?

`$ ls Chart.yaml crds README.md templates values.yaml

$ helm3 install -n sriov-network-operator --create-namespace --wait sriov-network-operator ./

$ kubectl get all -n sriov-network-operator NAME READY STATUS RESTARTS AGE pod/sriov-network-operator-845dc5dffc-4hvsb 1/1 Running 0 20m

NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/sriov-network-operator 1/1 1 1 20m

NAME DESIRED CURRENT READY AGE replicaset.apps/sriov-network-operator-845dc5dffc 1 1 1 20m

$ kubectl logs deployment.apps/sriov-network-operator -n sriov-network-operator|tail -5 2024-03-29T05:02:53.668128868Z INFO controller/controller.go:119 default SriovOperatorConfig object not found, cannot reconcile SriovNetworkNodePolicies. Requeue. {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "ed902977-3a07-4cea-bb20-0cefbff5ea9e"} 2024-03-29T05:02:58.668612364Z INFO controller/controller.go:119 Reconciling {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "98591413-4718-4d3c-abaf-14d3dcf1c43c"} 2024-03-29T05:02:58.668676704Z INFO controller/controller.go:119 default SriovOperatorConfig object not found, cannot reconcile SriovNetworkNodePolicies. Requeue. {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "98591413-4718-4d3c-abaf-14d3dcf1c43c"} 2024-03-29T05:03:03.669236989Z INFO controller/controller.go:119 Reconciling {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "2a0835ad-a117-4caa-8ace-9afc525b6d70"} 2024-03-29T05:03:03.669309844Z INFO controller/controller.go:119 default SriovOperatorConfig object not found, cannot reconcile SriovNetworkNodePolicies. Requeue. {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "2a0835ad-a117-4caa-8ace-9afc525b6d70"}

Additional info, may not be relevant.

$ kubectl label ns sriov-network-operator pod-security.kubernetes.io/enforce=privileged $ kubectl get node -l node-role.kubernetes.io/worker NAME STATUS ROLES AGE VERSION mtx-dell4-bld01.dc1.matrixxsw.com Ready worker 264d v1.26.6 mtx-dell4-bld02.dc1.matrixxsw.com Ready worker 264d v1.26.6 mtx-dell4-bld03.dc1.matrixxsw.com Ready worker 264d v1.26.6 `

Shall we / how do we get sriov-network-config-daemon installed? Thanks. -Jessica

Originally posted by @hymgg in https://github.com/k8snetworkplumbingwg/sriov-network-operator/issues/584#issuecomment-2026657454

hymgg avatar Apr 02 '24 06:04 hymgg

Hello, Can somebody help complete the sriov-network-operator installation? Is there another way?

hymgg avatar Apr 08 '24 19:04 hymgg

Hey, seems that you don't have the required SriovOperatorConfig named default.

It can be created with helm using the following parameters: https://github.com/k8snetworkplumbingwg/sriov-network-operator/tree/master/deployment/sriov-network-operator#sr-iov-operator-configuration-parameters

rollandf avatar Apr 09 '24 08:04 rollandf

@rollandf Thank you. set sriovOperatorConfig.deploy to true in default values.yaml, ran helm upgrade, the config daemon is up.

Compared to the example in quick-start, we're still missing the service obj, is that expected? shall we create the svc manually?

`$ kubectl --context dell4 get all -n sriov-network-operator NAME READY STATUS RESTARTS AGE pod/sriov-network-config-daemon-sxf4b 1/1 Running 0 25s pod/sriov-network-config-daemon-vzzg2 1/1 Running 0 25s pod/sriov-network-config-daemon-xn9rq 1/1 Running 0 25s pod/sriov-network-operator-845dc5dffc-4hvsb 1/1 Running 0 11d

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/sriov-network-config-daemon 3 3 3 3 3 kubernetes.io/os=linux,node-role.kubernetes.io/worker= 25s

NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/sriov-network-operator 1/1 1 1 11d

NAME DESIRED CURRENT READY AGE replicaset.apps/sriov-network-operator-845dc5dffc 1 1 1 11d `

hymgg avatar Apr 09 '24 23:04 hymgg

I don't think that the service is needed. Seems an issue in doc actually.

rollandf avatar Apr 10 '24 05:04 rollandf

@rollandf Thank you.

Next, with initial sriovnetworknodestates.sriovnetwork.openshift.io as: spec: dpConfigVersion: 2ea02bc305b6b7849ae7535c713eeb8e status: interfaces:

  • deviceID: 158a driver: i40e linkSpeed: 25000 Mb/s linkType: ETH mac: "12:21:04:20:01:02" mtu: 1500 name: p1p1 pciAddress: 0000:3b:00.0 vendor: "8086"
  • deviceID: 158a driver: i40e linkSpeed: 25000 Mb/s linkType: ETH mac: "12:21:04:20:01:03" mtu: 1500 name: ens1f1 pciAddress: 0000:3b:00.1 vendor: "8086"

I created a SriovNetworkNodePolicy,

apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-ens1f1 namespace: sriov-network-operator spec: nodeSelector: node-role.kubernetes.io/worker: #feature.node.kubernetes.io/network-sriov.capable: "true" resourceName: ens1f1 priority: 99 #mtu: 9000 numVfs: 8 nicSelector: deviceID: "158a" rootDevices: - 0000:3b:00.1 vendor: "8086" deviceType: netdevice

It triggered creation of sriov-device-plugin, but the operator pod went into CrashLoopBackOff state, logs reported "panic: runtime error: invalid memory address or nil pointer dereference" and "[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a7004d]"

How to fix this?

[mtx@mtx-dell4-bld08 sriov-network-operator]$ kubectl get all -n sriov-network-operator NAME READY STATUS RESTARTS AGE pod/sriov-device-plugin-mqr84 1/1 Running 0 13m pod/sriov-device-plugin-rc5jh 1/1 Running 0 13m pod/sriov-device-plugin-zl5m6 1/1 Running 0 13m pod/sriov-network-config-daemon-sxf4b 1/1 Running 0 27h pod/sriov-network-config-daemon-vzzg2 1/1 Running 0 27h pod/sriov-network-config-daemon-xn9rq 1/1 Running 0 27h pod/sriov-network-operator-845dc5dffc-4hvsb 0/1 CrashLoopBackOff 8 (3m37s ago) 12d

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/sriov-device-plugin 3 3 3 3 3 kubernetes.io/os=linux,node-role.kubernetes.io/worker= 13m daemonset.apps/sriov-network-config-daemon 3 3 3 3 3 kubernetes.io/os=linux,node-role.kubernetes.io/worker= 27h

NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/sriov-network-operator 0/1 1 0 12d

NAME DESIRED CURRENT READY AGE replicaset.apps/sriov-network-operator-845dc5dffc 1 1 0 12d

New state of sriovnetworknodestates.sriovnetwork.openshift.io:

spec: interfaces:

  • name: ens1f1 numVfs: 8 pciAddress: 0000:3b:00.1 vfGroups:
    • deviceType: netdevice policyName: policy-ens1f1 resourceName: ens1f1 vfRange: 0-7 status: interfaces:
  • deviceID: 158a driver: i40e linkSpeed: 25000 Mb/s linkType: ETH mac: "12:21:04:20:01:02" mtu: 1500 name: p1p1 pciAddress: 0000:3b:00.0 vendor: "8086"
  • deviceID: 158a driver: i40e linkSpeed: 25000 Mb/s linkType: ETH mac: "12:21:04:20:01:03" mtu: 1500 name: ens1f1 pciAddress: 0000:3b:00.1 vendor: "8086" lastSyncError: cannot configure sriov interfaces syncStatus: Failed

sriov-network-operator-845dc5dffc-4hvsb.log

Thanks. -Jessica

hymgg avatar Apr 11 '24 01:04 hymgg

Hi @hymgg there is a bug you can see the PR https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/679

in general please check your nodeSelector in the sriovNetworkNodePolicy can you do

kubectl -n sriov-network-operator get sriovnetworknodepolicy -oyaml

I think your node selector is not right and it gets empty selector that triggers the bug

SchSeba avatar Apr 11 '24 04:04 SchSeba

@SchSeba The SriovNetworkNodePolicy spec was pasted in my last comment. original nodeSelector from quickstart example was feature.node.kubernetes.io/network-sriov.capable: "true", but I didn't find any node with that label, so changed it to node-role.kubernetes.io/worker:, then daemonset.apps/sriov-device-plugin found 3 nodes and started 3 pods. Shall I instead keep the original nodeSelector and label the nodes accordingly?

Thanks. -Jessica policy-ens1f1.yaml.txt

hymgg avatar Apr 11 '24 06:04 hymgg

The yaml you shared is from a local file I want you to show me the one that is in the k8s api server.

please run kubectl -n sriov-network-operator get sriovnetworknodepolicy -oyaml and show me the output

SchSeba avatar Apr 11 '24 06:04 SchSeba

sriovnetworknodepolicy.yaml.txt

nodeSelector is empty in attached output

hymgg avatar Apr 11 '24 06:04 hymgg

yep that was my expectation

I think the label you wanted is something like:

nodeSelector:
  node-role.kubernetes.io/worker: ""

SchSeba avatar Apr 11 '24 07:04 SchSeba

@SchSeba Thank you. corrected nodeSelector in sriovnetworknodepolicy, the operator pod is back in Running state, but the sriovnetworknodestates still "cannot configure sriov interfaces", no VFs. Anything to check on hardware side?

`apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodeState metadata: annotations: sriovnetwork.openshift.io/current-state: Idle sriovnetwork.openshift.io/desired-state: Idle creationTimestamp: "2024-04-09T22:46:21Z" generation: 7 name: mtx-dell4-bld01.dc1.matrixxsw.com namespace: sriov-network-operator ownerReferences:

  • apiVersion: sriovnetwork.openshift.io/v1 blockOwnerDeletion: true controller: true kind: SriovOperatorConfig name: default uid: ea87a10e-5906-4d3b-a8b0-f67783bc36b6 resourceVersion: "172487551" uid: 52ebc9f9-d110-4915-ba9c-65b53b79c4b0 spec: interfaces:
  • name: ens1f1 numVfs: 8 pciAddress: 0000:3b:00.1 vfGroups:
    • deviceType: netdevice policyName: policy-ens1f1 resourceName: ens1f1 vfRange: 0-7 status: interfaces:
  • deviceID: 158a driver: i40e linkSpeed: 25000 Mb/s linkType: ETH mac: "12:21:04:20:01:02" mtu: 1500 name: p1p1 pciAddress: 0000:3b:00.0 vendor: "8086"
  • deviceID: 158a driver: i40e linkSpeed: 25000 Mb/s linkType: ETH mac: "12:21:04:20:01:03" mtu: 1500 name: ens1f1 pciAddress: 0000:3b:00.1 vendor: "8086" lastSyncError: cannot configure sriov interfaces syncStatus: Failed `

One of the sriov-device-plugin pod log (the other 2 are similar):

I0411 19:59:15.007473 1 manager.go:57] Using Kubelet Plugin Registry Mode I0411 19:59:15.007514 1 main.go:46] resource manager reading configs I0411 19:59:15.007539 1 manager.go:86] raw ResourceList: {"resourceList":[{"resourceName":"ens1f1","selectors":{"vendors":["8086"],"devices":["154c"],"rootDevices":["0000:3b:00.1"],"IsRdma":false,"NeedVhostNet":false},"SelectorObj":null}]} I0411 19:59:15.007632 1 factory.go:211] *types.NetDeviceSelectors for resource ens1f1 is [0xc0004ad7a0] I0411 19:59:15.007641 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:ens1f1 DeviceType:netDevice ExcludeTopology:false Selectors:0xc0004b3350 AdditionalInfo:map[] SelectorObjs:[0xc0004ad7a0]}] I0411 19:59:15.007677 1 manager.go:217] validating resource name "openshift.io/ens1f1" I0411 19:59:15.007682 1 main.go:62] Discovering host devices I0411 19:59:15.087670 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:3b:00.0 02 Intel Corporation Ethernet Controller XXV710 for 25GbE ... I0411 19:59:15.088179 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:3b:00.1 02 Intel Corporation Ethernet Controller XXV710 for 25GbE ... I0411 19:59:15.088509 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:3b:00.0 02 Intel Corporation Ethernet Controller XXV710 for 25GbE ... I0411 19:59:15.088540 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:3b:00.1 02 Intel Corporation Ethernet Controller XXV710 for 25GbE ... I0411 19:59:15.088549 1 main.go:68] Initializing resource servers I0411 19:59:15.088555 1 manager.go:117] number of config: 1 I0411 19:59:15.088560 1 manager.go:121] Creating new ResourcePool: ens1f1 I0411 19:59:15.088565 1 manager.go:122] DeviceType: netDevice I0411 19:59:15.088807 1 manager.go:138] initServers(): selector index 0 will register 0 devices I0411 19:59:15.088820 1 manager.go:142] no devices in device pool, skipping creating resource server for ens1f1 I0411 19:59:15.088826 1 main.go:74] Starting all servers... I0411 19:59:15.088832 1 main.go:79] All servers started. I0411 19:59:15.088839 1 main.go:80] Listening for term signals

hymgg avatar Apr 11 '24 22:04 hymgg

sriov-network-operator-845dc5dffc-4hvsb (2).log

operator pod log.

hymgg avatar Apr 11 '24 22:04 hymgg

Is there a way to debug this issue? failed to configure sriov on interface. Worker nodes are running k8s 1.26, RH8.6

hymgg avatar Apr 15 '24 22:04 hymgg

Hi as you can see it's an intel nic and in the status of the sriovNetworkNodeState there is no maxVf that points me out that you didn't enable sriov in the bios of the machine.

SchSeba avatar Apr 16 '24 14:04 SchSeba

@SchSeba Thanks! checking with lab on this.

hymgg avatar Apr 16 '24 22:04 hymgg

Lab team enabled sriov on the NICs. SriovNetworkNodeState now reports totalvfs: 64, but still "cannot configure sriov interfaces", tried delete and apply the same SriovNetworkNodePolicy, didn't help.

`$ kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -n sriov-network-operator mtx-dell4-bld01.dc1.matrixxsw.com -o yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodeState metadata: annotations: sriovnetwork.openshift.io/current-state: Idle sriovnetwork.openshift.io/desired-state: Idle creationTimestamp: "2024-04-09T22:46:21Z" generation: 9 name: mtx-dell4-bld01.dc1.matrixxsw.com namespace: sriov-network-operator ownerReferences:

  • apiVersion: sriovnetwork.openshift.io/v1 blockOwnerDeletion: true controller: true kind: SriovOperatorConfig name: default uid: ea87a10e-5906-4d3b-a8b0-f67783bc36b6 resourceVersion: "178734156" uid: 52ebc9f9-d110-4915-ba9c-65b53b79c4b0 spec: interfaces:
  • name: ens1f1 numVfs: 8 pciAddress: 0000:3b:00.1 vfGroups:
    • deviceType: netdevice policyName: policy-ens1f1 resourceName: ens1f1 vfRange: 0-7 status: interfaces:
  • deviceID: 158a driver: i40e eSwitchMode: legacy linkSpeed: 25000 Mb/s linkType: ETH mac: "12:21:04:20:01:02" mtu: 1500 name: p1p1 pciAddress: 0000:3b:00.0 totalvfs: 64 vendor: "8086"
  • deviceID: 158a driver: i40e eSwitchMode: legacy linkSpeed: 25000 Mb/s linkType: ETH mac: "12:21:04:20:01:03" mtu: 1500 name: ens1f1 pciAddress: 0000:3b:00.1 totalvfs: 64 vendor: "8086" lastSyncError: cannot configure sriov interfaces syncStatus: Failed `

hymgg avatar Apr 17 '24 18:04 hymgg

Anything else we should check?

hymgg avatar Apr 19 '24 00:04 hymgg

Can you provide new logs from config daemon?

rollandf avatar Apr 21 '24 06:04 rollandf

Config daemon says cannot allocate memory. Uploading logs from config daemon, device plugin and operator.

2024-04-22T20:43:36.502722528Z ERROR sriov/sriov.go:992 SetSriovNumVfs(): fail to set NumVfs file {"path": "/sys/bus/pci/devices/0000:3b:00.1/sriov_numvfs", "error": "write /sys/bus/pci/devices/0000:3b:00.1/sriov_numvfs: cannot allocate memory"} 2024-04-22T20:43:36.502748646Z ERROR sriov/sriov.go:545 configSriovPFDevice(): fail to set NumVfs for device {"device": "0000:3b:00.1", "error": "write /sys/bus/pci/devices/0000:3b:00.1/sriov_numvfs: cannot allocate memory"} 2024-04-22T20:43:36.502758263Z ERROR sriov/sriov.go:594 configSriovInterfaces(): fail to configure sriov interface. resetting interface. {"address": "0000:3b:00.1", "error": "write /sys/bus/pci/devices/0000:3b:00.1/sriov_numvfs: cannot allocate memory"} 2024-04-22T20:43:36.503045188Z ERROR generic/generic_plugin.go:183 cannot configure sriov interfaces {"error": "write /sys/bus/pci/devices/0000:3b:00.1/sriov_numvfs: cannot allocate memory"} 2024-04-22T20:43:36.503061016Z ERROR daemon/daemon.go:259 nodeStateSyncHandler(): generic plugin fail to apply {"error": "cannot configure sriov interfaces"} sriov-network-operator-845dc5dffc-4hvsb.log sriov-device-plugin-pnld9.log sriov-network-config-daemon-sxf4b.log

hymgg avatar Apr 22 '24 20:04 hymgg

Seems that you need to add the following kernel arg: pci=realloc to your server.

rollandf avatar Apr 24 '24 06:04 rollandf

@hymgg can you please check I think there is on the bios something called 4M memory or something like that

SchSeba avatar Apr 24 '24 10:04 SchSeba

@rollandf @SchSeba The VFs showed up after adding pci=realloc to kernel. Thanks! But, according to quickstart guide, next the VFs should be reported as node allocatable resources, that didn't happen. The device-plugin pods also take turn to terminate and recreate. Their logs reported "error creating new device"

`$ kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -n sriov-network-operator mtx-dell4-bld01.dc1.matrixxsw.com -o yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodeState metadata: annotations: sriovnetwork.openshift.io/current-state: Idle sriovnetwork.openshift.io/desired-state: Drain_Required creationTimestamp: "2024-04-09T22:46:21Z" generation: 9 name: mtx-dell4-bld01.dc1.matrixxsw.com namespace: sriov-network-operator ownerReferences:

  • apiVersion: sriovnetwork.openshift.io/v1 blockOwnerDeletion: true controller: true kind: SriovOperatorConfig name: default uid: ea87a10e-5906-4d3b-a8b0-f67783bc36b6 resourceVersion: "186573701" uid: 52ebc9f9-d110-4915-ba9c-65b53b79c4b0 spec: interfaces:
  • name: ens1f1 numVfs: 8 pciAddress: 0000:3b:00.1 vfGroups:
    • deviceType: netdevice policyName: policy-ens1f1 resourceName: ens1f1 vfRange: 0-7 status: interfaces:
  • deviceID: 158a driver: i40e eSwitchMode: legacy linkSpeed: 25000 Mb/s linkType: ETH mac: "12:21:04:20:01:02" mtu: 1500 name: p1p1 pciAddress: 0000:3b:00.0 totalvfs: 64 vendor: "8086"
  • Vfs:
    • deviceID: 154c pciAddress: 0000:3b:0a.0 vendor: "8086" vfID: 0
    • deviceID: 154c pciAddress: 0000:3b:0a.1 vendor: "8086" vfID: 1
    • deviceID: 154c pciAddress: 0000:3b:0a.2 vendor: "8086" vfID: 2
    • deviceID: 154c pciAddress: 0000:3b:0a.3 vendor: "8086" vfID: 3
    • deviceID: 154c pciAddress: 0000:3b:0a.4 vendor: "8086" vfID: 4
    • deviceID: 154c pciAddress: 0000:3b:0a.5 vendor: "8086" vfID: 5
    • deviceID: 154c pciAddress: 0000:3b:0a.6 vendor: "8086" vfID: 6
    • deviceID: 154c pciAddress: 0000:3b:0a.7 vendor: "8086" vfID: 7 deviceID: 158a driver: i40e eSwitchMode: legacy linkSpeed: 25000 Mb/s linkType: ETH mac: "12:21:04:20:01:03" mtu: 1500 name: ens1f1 numVfs: 8 pciAddress: 0000:3b:00.1 totalvfs: 64 vendor: "8086" syncStatus: InProgress `

$ kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]' [ { "name": "mtx-dell4-bld01.dc1.matrixxsw.com", "allocable": { "cpu": "64", "ephemeral-storage": "213255452729", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "394453236Ki", "pods": "110" } }, ...

$ kubectl get all -n sriov-network-operator NAME READY STATUS RESTARTS AGE pod/sriov-device-plugin-kczqc 1/1 Terminating 0 11s pod/sriov-device-plugin-pq2xz 1/1 Running 0 10m pod/sriov-device-plugin-txb5j 1/1 Running 0 1s ...

$ kubectl logs sriov-device-plugin-4mdbw -n sriov-network-operator I0425 02:48:15.236595 1 manager.go:57] Using Kubelet Plugin Registry Mode I0425 02:48:15.236654 1 main.go:46] resource manager reading configs I0425 02:48:15.236679 1 manager.go:86] raw ResourceList: {"resourceList":[{"resourceName":"ens1f1","selectors":{"vendors":["8086"],"devices":["154c"],"rootDevices":["0000:3b:00.1"],"IsRdma":false,"NeedVhostNet":false},"SelectorObj":null}]} I0425 02:48:15.236752 1 factory.go:211] *types.NetDeviceSelectors for resource ens1f1 is [0xc0001eaa20] I0425 02:48:15.236770 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:ens1f1 DeviceType:netDevice ExcludeTopology:false Selectors:0xc00053a180 AdditionalInfo:map[] SelectorObjs:[0xc0001eaa20]}] I0425 02:48:15.236799 1 manager.go:217] validating resource name "openshift.io/ens1f1" I0425 02:48:15.236807 1 main.go:62] Discovering host devices I0425 02:48:15.312076 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:3b:00.0 02 Intel Corporation Ethernet Controller XXV710 for 25GbE ... I0425 02:48:15.312341 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:3b:00.1 02 Intel Corporation Ethernet Controller XXV710 for 25GbE ... I0425 02:48:15.312536 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:3b:0a.0 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312560 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:3b:0a.1 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312574 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:3b:0a.2 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312589 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:3b:0a.3 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312602 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:3b:0a.4 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312615 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:3b:0a.5 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312628 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:3b:0a.6 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312642 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:3b:0a.7 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312673 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:3b:00.0 02 Intel Corporation Ethernet Controller XXV710 for 25GbE ... I0425 02:48:15.312690 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:3b:00.1 02 Intel Corporation Ethernet Controller XXV710 for 25GbE ... I0425 02:48:15.312695 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:3b:0a.0 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312699 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:3b:0a.1 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312704 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:3b:0a.2 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312707 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:3b:0a.3 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312711 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:3b:0a.4 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312714 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:3b:0a.5 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312717 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:3b:0a.6 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312722 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:3b:0a.7 02 Intel Corporation Ethernet Virtual Function 700 Series I0425 02:48:15.312727 1 main.go:68] Initializing resource servers I0425 02:48:15.312734 1 manager.go:117] number of config: 1 I0425 02:48:15.312739 1 manager.go:121] Creating new ResourcePool: ens1f1 I0425 02:48:15.312742 1 manager.go:122] DeviceType: netDevice E0425 02:48:15.312843 1 netDeviceProvider.go:50] netdevice GetDevices(): error creating new device: "error getting driver info for device 0000:3b:0a.0 readlink /sys/bus/pci/devices/0000:3b:0a.0/driver: no such file or directory" E0425 02:48:15.312854 1 netDeviceProvider.go:50] netdevice GetDevices(): error creating new device: "error getting driver info for device 0000:3b:0a.1 readlink /sys/bus/pci/devices/0000:3b:0a.1/driver: no such file or directory" E0425 02:48:15.312863 1 netDeviceProvider.go:50] netdevice GetDevices(): error creating new device: "error getting driver info for device 0000:3b:0a.2 readlink /sys/bus/pci/devices/0000:3b:0a.2/driver: no such file or directory" E0425 02:48:15.312870 1 netDeviceProvider.go:50] netdevice GetDevices(): error creating new device: "error getting driver info for device 0000:3b:0a.3 readlink /sys/bus/pci/devices/0000:3b:0a.3/driver: no such file or directory" E0425 02:48:15.312878 1 netDeviceProvider.go:50] netdevice GetDevices(): error creating new device: "error getting driver info for device 0000:3b:0a.4 readlink /sys/bus/pci/devices/0000:3b:0a.4/driver: no such file or directory" E0425 02:48:15.312885 1 netDeviceProvider.go:50] netdevice GetDevices(): error creating new device: "error getting driver info for device 0000:3b:0a.5 readlink /sys/bus/pci/devices/0000:3b:0a.5/driver: no such file or directory" E0425 02:48:15.312893 1 netDeviceProvider.go:50] netdevice GetDevices(): error creating new device: "error getting driver info for device 0000:3b:0a.6 readlink /sys/bus/pci/devices/0000:3b:0a.6/driver: no such file or directory" E0425 02:48:15.312902 1 netDeviceProvider.go:50] netdevice GetDevices(): error creating new device: "error getting driver info for device 0000:3b:0a.7 readlink /sys/bus/pci/devices/0000:3b:0a.7/driver: no such file or directory" I0425 02:48:15.312910 1 manager.go:138] initServers(): selector index 0 will register 0 devices I0425 02:48:15.312916 1 manager.go:142] no devices in device pool, skipping creating resource server for ens1f1 I0425 02:48:15.312931 1 main.go:74] Starting all servers... I0425 02:48:15.312935 1 main.go:79] All servers started. I0425 02:48:15.312939 1 main.go:80] Listening for term signals

hymgg avatar Apr 25 '24 03:04 hymgg

Forgot to mention, while the device plugin pods terminate/create, the nodes take turn to go into SchedulingDisabled state too.

$ kubectl get node NAME STATUS ROLES AGE VERSION mtx-dell4-bld01.dc1.matrixxsw.com Ready worker 291d v1.26.6 mtx-dell4-bld02.dc1.matrixxsw.com Ready,SchedulingDisabled worker 291d v1.26.6 mtx-dell4-bld03.dc1.matrixxsw.com Ready worker 291d v1.26.6

hymgg avatar Apr 26 '24 05:04 hymgg

Hello, any other ideas to investigate?

hymgg avatar Apr 29 '24 21:04 hymgg

still looking for remedy to the situation...

hymgg avatar May 02 '24 04:05 hymgg

@hymgg I ran into this almost 8 months ago. Almost everything in your post. On a single node, clean test cluster this thing works. But our nodes have hundreds labels.

Are your clusters rke?

Generally this project did not seem great. Just even that the labels are hard coded is absolutely terrible.

I'm already not looking forward to trudging down this path again.

ns-rlewkowicz avatar May 03 '24 19:05 ns-rlewkowicz

@ns-rlewkowicz Thanks for sharing your experience. Is there an alternative that works better?

This is a vanilla k8s on bare metal rh8 nodes, installed with kubeadm. Just the essentials, nothing fancy.

hymgg avatar May 03 '24 23:05 hymgg

GetDevices(): error creating new device: "error getting driver info for device 0000:3b:0a.0 readlink /sys/bus/pci/devices/0000:3b:0a.0/driver: no such file or directory"

are the created SR-IOV virtual functions bound to intel driver ? from the logs it doesnt seem so. is the VF driver installed in your OS ? does cat /sys/bus/pci/devices/0000\:3b\:00.0/sriov_drivers_autoprobe return 1 ?

adrianchiris avatar May 05 '24 06:05 adrianchiris

@hymgg can you please run

lspci -v -nn -mm -k -s 0000:3b:00.0
lspci -vvv 0000:3b:00.0

and also please check again in the bios configuration about

Memory Mapped I/O above 4GB : enable

SchSeba avatar May 06 '24 12:05 SchSeba

Hi @ns-rlewkowicz any specific issue that the community can help with? we have a large number of users using the operator on large clusters

SchSeba avatar May 06 '24 12:05 SchSeba