sriov-network-device-plugin icon indicating copy to clipboard operation
sriov-network-device-plugin copied to clipboard

update the capacity to zero on shutdown/reset

Open SchSeba opened this issue 2 years ago • 12 comments
trafficstars

When the device plugin is restarted, kubelet marks the resource as unhealthy, but still reports the resource as existing for a grace period (5 mins). If a pod is scheduled before the device plugin comes up, the pod create fails without a retryloop with an error message Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices <DEVICE_NAME>, which is unexpected.

This commit allow the device plugin to send an empty list of devices before the reset or shutdown

SchSeba avatar Oct 02 '23 07:10 SchSeba

@zeeke please give it another look :)

SchSeba avatar Oct 02 '23 08:10 SchSeba

hmm im not sure that thats what kubelet expects (being suddenly reported there are no devices when device plugin shuts down or restarts)

will it remove entries from checkpoint file ? what if there are pods already using resources ?

adrianchiris avatar Oct 05 '23 09:10 adrianchiris

hmm im not sure that thats what kubelet expects (being suddenly reported there are no devices when device plugin shuts down or restarts) will it remove entries from checkpoint file ? what if there are pods already using resources ?

From my tests it doesn't remove the checkpoint file and running pods continue to run (I will try to leave it down for a longer period to be sure there is no reconcile or something that will kill the running pods)

without this change, we have an issue when we take down the pod and the pod is allocated to that node the pod will not be able to start.

SchSeba avatar Oct 21 '23 15:10 SchSeba

related to https://github.com/openshift/sriov-network-operator/issues/812

SchSeba avatar Oct 21 '23 15:10 SchSeba

@SchSeba been digging a bit into kubelet code

kubelet will report updated status to kubelet every 10 seconds[1] (every NodeStatusUpdateFrequency[2])

once device plugin plugin exits (its endpoint no longer valid), all devices are deemed unhealty[3]

[1] https://github.com/kubernetes/kubernetes/blob/d61cbac69aae97db1839bd2e0e86d68f26b353a7/pkg/kubelet/kubelet.go#L1637 [2] https://github.com/kubernetes/kubernetes/blob/d61cbac69aae97db1839bd2e0e86d68f26b353a7/staging/src/k8s.io/kubelet/config/v1beta1/types.go#L267 [3] https://github.com/kubernetes/kubernetes/blob/d61cbac69aae97db1839bd2e0e86d68f26b353a7/pkg/kubelet/nodestatus/setters.go#L342

so after at most 10 seconds node will report zero allocatable resources if plugin has exited. if any pod is scheduled to the node before that, it will enter admission error. setting devices explicitly as un-healthy will not help as unhealty devices is only reported to kube api during that sync loop.

what i suggest, is in sriov-network-operator after we remove device plugin to spin up a goroutine to keep deleting pods in admission error if they consume resources from device plugin until new device plugin is up or alternatively (even better option imo) once device plugin is up again clean up any pods which consume dp resources in admission error once.

LMK what you think.

adrianchiris avatar Nov 23 '23 14:11 adrianchiris

Interesting I will try to implement a POC in the operator but before that a question do you see any issue changing the number to 0 when we reboot? can that effect something?

SchSeba avatar Dec 05 '23 16:12 SchSeba

  1. i do not know what the original authors of the API (grpc) intended.
  2. technically we dont achieve more reliability if we do it IMO.
  3. when device plugin exits, it will mark the resources reported as invalid (internally) IIRC. which is the same as if plugin exited. in both cases it only affects scheduler when node status is updated in apiserver.

adrianchiris avatar Dec 05 '23 16:12 adrianchiris

IIUC, if the device plugin exits and leaves the resources unhealthy (for a while I guess), any deployed Pod will get an admission error. If the Pod comes from a Deployment or a Replicaset, the kube controller spawns a lot of subsequent Pods, creating a little junk:

NAME                                                 READY   STATUS                        RESTARTS   AGE    IP               NODE       NOMINATED NODE   READINESS GATES
deploy1-5686945dcc-2x9j5                              0/6     UnexpectedAdmissionError      0          43h    <none>           worker-2   <none>           <none>
deploy1-5686945dcc-4bdpz                              0/6     UnexpectedAdmissionError      0          43h    <none>           worker-2   <none>           <none>
deploy1-5686945dcc-4ht8f                              0/6     Init:ContainerStatusUnknown   0          43h    <none>           worker-2   <none>           <none>
deploy1-5686945dcc-4xv28                              0/6     UnexpectedAdmissionError      0          43h    <none>           worker-2   <none>           <none>

Setting the device count to 0 will prevent the scheduler from selecting the node, making it retry the same Pod in an exponential backoff fashion.

zeeke avatar Dec 06 '23 12:12 zeeke

Pull Request Test Coverage Report for Build 6377485447

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 31 of 39 (79.49%) changed or added relevant lines in 1 file are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 75.876%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/resources/server.go 31 39 79.49%
<!-- Total: 31 39
Files with Coverage Reduction New Missed Lines %
pkg/resources/server.go 2 78.52%
<!-- Total: 2
Totals Coverage Status
Change from base Build 6371847942: -0.2%
Covered Lines: 2013
Relevant Lines: 2653

💛 - Coveralls

coveralls avatar May 24 '24 13:05 coveralls

Pull Request Test Coverage Report for Build 6377734797

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 31 of 40 (77.5%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 75.867%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/resources/server.go 31 40 77.5%
<!-- Total: 31 40
Files with Coverage Reduction New Missed Lines %
pkg/resources/server.go 1 78.45%
<!-- Total: 1
Totals Coverage Status
Change from base Build 6371847942: -0.2%
Covered Lines: 2012
Relevant Lines: 2652

💛 - Coveralls

coveralls avatar May 24 '24 13:05 coveralls

Pull Request Test Coverage Report for Build 6377419645

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 31 of 41 (75.61%) changed or added relevant lines in 1 file are covered.
  • 19 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 75.876%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/resources/server.go 31 41 75.61%
<!-- Total: 31 41
Files with Coverage Reduction New Missed Lines %
pkg/resources/server.go 19 78.52%
<!-- Total: 19
Totals Coverage Status
Change from base Build 6371847942: -0.2%
Covered Lines: 2013
Relevant Lines: 2653

💛 - Coveralls

coveralls avatar May 24 '24 13:05 coveralls

Pull Request Test Coverage Report for Build 6377419645

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 31 of 41 (75.61%) changed or added relevant lines in 1 file are covered.
  • 19 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 75.876%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/resources/server.go 31 41 75.61%
<!-- Total: 31 41
Files with Coverage Reduction New Missed Lines %
pkg/resources/server.go 19 78.52%
<!-- Total: 19
Totals Coverage Status
Change from base Build 6371847942: -0.2%
Covered Lines: 2013
Relevant Lines: 2653

💛 - Coveralls

coveralls avatar May 24 '24 13:05 coveralls