sriov-network-device-plugin
sriov-network-device-plugin copied to clipboard
update the capacity to zero on shutdown/reset
When the device plugin is restarted, kubelet marks the resource as unhealthy, but still reports the resource as existing for a grace period (5 mins). If a pod is scheduled before the device plugin comes up, the pod create fails without a retryloop with an error message Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices <DEVICE_NAME>, which is unexpected.
This commit allow the device plugin to send an empty list of devices before the reset or shutdown
@zeeke please give it another look :)
hmm im not sure that thats what kubelet expects (being suddenly reported there are no devices when device plugin shuts down or restarts)
will it remove entries from checkpoint file ? what if there are pods already using resources ?
hmm im not sure that thats what kubelet expects (being suddenly reported there are no devices when device plugin shuts down or restarts) will it remove entries from checkpoint file ? what if there are pods already using resources ?
From my tests it doesn't remove the checkpoint file and running pods continue to run (I will try to leave it down for a longer period to be sure there is no reconcile or something that will kill the running pods)
without this change, we have an issue when we take down the pod and the pod is allocated to that node the pod will not be able to start.
related to https://github.com/openshift/sriov-network-operator/issues/812
@SchSeba been digging a bit into kubelet code
kubelet will report updated status to kubelet every 10 seconds[1] (every NodeStatusUpdateFrequency[2])
once device plugin plugin exits (its endpoint no longer valid), all devices are deemed unhealty[3]
[1] https://github.com/kubernetes/kubernetes/blob/d61cbac69aae97db1839bd2e0e86d68f26b353a7/pkg/kubelet/kubelet.go#L1637 [2] https://github.com/kubernetes/kubernetes/blob/d61cbac69aae97db1839bd2e0e86d68f26b353a7/staging/src/k8s.io/kubelet/config/v1beta1/types.go#L267 [3] https://github.com/kubernetes/kubernetes/blob/d61cbac69aae97db1839bd2e0e86d68f26b353a7/pkg/kubelet/nodestatus/setters.go#L342
so after at most 10 seconds node will report zero allocatable resources if plugin has exited. if any pod is scheduled to the node before that, it will enter admission error. setting devices explicitly as un-healthy will not help as unhealty devices is only reported to kube api during that sync loop.
what i suggest, is in sriov-network-operator after we remove device plugin to spin up a goroutine to keep deleting pods in admission error if they consume resources from device plugin until new device plugin is up or alternatively (even better option imo) once device plugin is up again clean up any pods which consume dp resources in admission error once.
LMK what you think.
Interesting I will try to implement a POC in the operator but before that a question do you see any issue changing the number to 0 when we reboot? can that effect something?
- i do not know what the original authors of the API (grpc) intended.
- technically we dont achieve more reliability if we do it IMO.
- when device plugin exits, it will mark the resources reported as invalid (internally) IIRC. which is the same as if plugin exited. in both cases it only affects scheduler when node status is updated in apiserver.
IIUC, if the device plugin exits and leaves the resources unhealthy (for a while I guess), any deployed Pod will get an admission error. If the Pod comes from a Deployment or a Replicaset, the kube controller spawns a lot of subsequent Pods, creating a little junk:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
deploy1-5686945dcc-2x9j5 0/6 UnexpectedAdmissionError 0 43h <none> worker-2 <none> <none>
deploy1-5686945dcc-4bdpz 0/6 UnexpectedAdmissionError 0 43h <none> worker-2 <none> <none>
deploy1-5686945dcc-4ht8f 0/6 Init:ContainerStatusUnknown 0 43h <none> worker-2 <none> <none>
deploy1-5686945dcc-4xv28 0/6 UnexpectedAdmissionError 0 43h <none> worker-2 <none> <none>
Setting the device count to 0 will prevent the scheduler from selecting the node, making it retry the same Pod in an exponential backoff fashion.
Pull Request Test Coverage Report for Build 6377485447
Warning: This coverage report may be inaccurate.
This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
- For more information on this, see Tracking coverage changes with pull request builds.
- To avoid this issue with future PRs, see these Recommended CI Configurations.
- For a quick fix, rebase this PR at GitHub. Your next report should be accurate.
Details
- 31 of 39 (79.49%) changed or added relevant lines in 1 file are covered.
- 2 unchanged lines in 1 file lost coverage.
- Overall coverage decreased (-0.2%) to 75.876%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| pkg/resources/server.go | 31 | 39 | 79.49% |
| <!-- | Total: | 31 | 39 |
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| pkg/resources/server.go | 2 | 78.52% |
| <!-- | Total: | 2 |
| Totals | |
|---|---|
| Change from base Build 6371847942: | -0.2% |
| Covered Lines: | 2013 |
| Relevant Lines: | 2653 |
💛 - Coveralls
Pull Request Test Coverage Report for Build 6377734797
Warning: This coverage report may be inaccurate.
This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
- For more information on this, see Tracking coverage changes with pull request builds.
- To avoid this issue with future PRs, see these Recommended CI Configurations.
- For a quick fix, rebase this PR at GitHub. Your next report should be accurate.
Details
- 31 of 40 (77.5%) changed or added relevant lines in 1 file are covered.
- 1 unchanged line in 1 file lost coverage.
- Overall coverage decreased (-0.2%) to 75.867%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| pkg/resources/server.go | 31 | 40 | 77.5% |
| <!-- | Total: | 31 | 40 |
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| pkg/resources/server.go | 1 | 78.45% |
| <!-- | Total: | 1 |
| Totals | |
|---|---|
| Change from base Build 6371847942: | -0.2% |
| Covered Lines: | 2012 |
| Relevant Lines: | 2652 |
💛 - Coveralls
Pull Request Test Coverage Report for Build 6377419645
Warning: This coverage report may be inaccurate.
This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
- For more information on this, see Tracking coverage changes with pull request builds.
- To avoid this issue with future PRs, see these Recommended CI Configurations.
- For a quick fix, rebase this PR at GitHub. Your next report should be accurate.
Details
- 31 of 41 (75.61%) changed or added relevant lines in 1 file are covered.
- 19 unchanged lines in 1 file lost coverage.
- Overall coverage decreased (-0.2%) to 75.876%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| pkg/resources/server.go | 31 | 41 | 75.61% |
| <!-- | Total: | 31 | 41 |
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| pkg/resources/server.go | 19 | 78.52% |
| <!-- | Total: | 19 |
| Totals | |
|---|---|
| Change from base Build 6371847942: | -0.2% |
| Covered Lines: | 2013 |
| Relevant Lines: | 2653 |
💛 - Coveralls
Pull Request Test Coverage Report for Build 6377419645
Warning: This coverage report may be inaccurate.
This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
- For more information on this, see Tracking coverage changes with pull request builds.
- To avoid this issue with future PRs, see these Recommended CI Configurations.
- For a quick fix, rebase this PR at GitHub. Your next report should be accurate.
Details
- 31 of 41 (75.61%) changed or added relevant lines in 1 file are covered.
- 19 unchanged lines in 1 file lost coverage.
- Overall coverage decreased (-0.2%) to 75.876%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| pkg/resources/server.go | 31 | 41 | 75.61% |
| <!-- | Total: | 31 | 41 |
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| pkg/resources/server.go | 19 | 78.52% |
| <!-- | Total: | 19 |
| Totals | |
|---|---|
| Change from base Build 6371847942: | -0.2% |
| Covered Lines: | 2013 |
| Relevant Lines: | 2653 |