serving
serving copied to clipboard
Failed to sync with `ReplicaFailure` in ksvc creation sometimes
The problem happens in release 0.17, but should not be a regression issue on 0.17.
When creating knative application in a namespace in which limit range min/max specified, i.e. limit range min for cpu 10m, sometimes I can get the expected error msg of 'pod creation forbidden', but sometimes not and just knative application creation failed with ProgressDeadlineExceeded
.
This is the output for the expected behaviour:
$ kn service create test3 --image docker.io/cdlliuy/kn-helloworld -n ca482111-7675 --request cpu=1m --limit cpu=1m --force
Replacing service 'test3' in namespace 'ca482111-7675':
0.363s Configuration "test3" is waiting for a Revision to become ready.
2.084s Revision "test3-bqdlg-1" failed with message: pods "test3-bqdlg-1-deployment-7dcfc469f6-658tj" is forbidden: minimum cpu usage per Container is 10m, but request is 1m.
2.121s Configuration "test3" does not have any ready Revision.
2.315s ...
2.356s Configuration "test3" is waiting for a Revision to become ready.
Error: RevisionFailed: Revision "test3-bqdlg-1" failed with message: pods "test3-bqdlg-1-deployment-7dcfc469f6-658tj" is forbidden: minimum cpu usage per Container is 10m, but request is 1m.
Run 'kn --help' for usage
But with similar cmd (just another ksvc name), it hangs..
$ kn service create test4 --image docker.io/cdlliuy/kn-helloworld -n ca482111-7675 --request cpu=1m --limit cpu=1m --force
Creating service 'test4' in namespace 'ca482111-7675':
0.219s The Route is still working to reflect the latest desired specification.
0.291s Configuration "test4" is waiting for a Revision to become ready.
^C
Checking the deployment status of the latter one, the ReplicaFailure is caught.
- lastTransitionTime: "2020-10-19T05:33:57Z"
lastUpdateTime: "2020-10-19T05:33:57Z"
message: 'pods "test4-hdqhd-1-deployment-576b96bc76-rb6tj" is forbidden: minimum
cpu usage per Container is 10m, but request is 1m'
reason: FailedCreate
status: "True"
type: ReplicaFailure
But for revision ..
- lastTransitionTime: "2020-10-19T05:33:57Z"
reason: Deploying
status: Unknown
type: ContainerHealthy
- lastTransitionTime: "2020-10-19T05:36:28Z"
message: Initial scale was never achieved
reason: ProgressDeadlineExceeded
status: "False"
type: Ready
- lastTransitionTime: "2020-10-19T05:36:28Z"
message: Initial scale was never achieved
reason: ProgressDeadlineExceeded
status: "False"
type: ResourcesAvailable
In knative controller log output, given there is no enough logs exposed in https://github.com/knative/serving/blob/release-0.17/pkg/reconciler/revision/reconcile_resources.go#L62-L78, it is hard to say whether the deployment status changes triggered the revision reconcile in the unexpected case.
I think it is a kind of race condition. Any insight ?
@cdlliuy thanks for the report - I'll take a look tomorrow (I'm on EST) and see if I can reproduce the issue
When creating knative application in a namespace in which limit range min/max specified,
Are you setting a ResourceQuota or LimitRange on the namespace? Do you have the example yaml?
Also what version of K8s are you are running
@dprotaso , I am running on k8s v0.17 with limit range :
spec:
limits:
- default:
cpu: 100m
defaultRequest:
cpu: 100m
max:
cpu: "8"
min:
cpu: 10m
type: Container
The resource quota is set but won't take effect in this case, since I am running with a very small cpu request number
Great thanks - I'll take a look later today
So I was able to repro on Kind and this script. Looking at the code we don't propagate the deployment status to the revision unless it's 'active' so it'll time out. I don't recall what triggers the revision becoming active.
I wasn't able to see your first error where the status was propagated correctly - ie.forbidden: minimum cpu usage per Container is 10m, but request is 1m.
I do see the revision becomes 'Ready: True' after the autoscaler scales the deployment to zero. But since we never reached our initial scale this is misleading.
I'm going to throw this into the current release (v0.19) for someone to pick up. Otherwise I'll pick it up for v0.20
Potentially Related: https://github.com/knative/serving/issues/8540
@dprotaso , I think the similar issue also happens when the namespace has a resource quota or the cluster resource is exhausted.
Sometimes, the resource quota breached or insufficient resource is thrown out to ksvc service CR layer, so that the end user can aware of it.
But sometimes, it just got stuck with revisionMissing. Then the end-user need to dive down until the replicaset layer to find out the failure reason.
Can you share what is the root cause of the issue in your head? Maybe we can also contribute some efforts to get this issue fixed. Anyway, currently, I don't have any idea why it fails.
@dprotaso , can you help to share a little bit of the idea to fix this?
/good-first-issue /area API
Given that there's a repro script and @dprotaso wants to get this into a release, I'm guessing that this is something that a pair of hands could pick up and manage towards a successful completion.
/triage accepted.
@evankanderson: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue
command.
In response to this:
/good-first-issue /area API
Given that there's a repro script and @dprotaso wants to get this into a release, I'm guessing that this is something that a pair of hands could pick up and manage towards a successful completion.
/triage accepted.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@evankanderson: The label(s) triage/accepted.
cannot be applied, because the repository doesn't have them.
In response to this:
/good-first-issue /area API
Given that there's a repro script and @dprotaso wants to get this into a release, I'm guessing that this is something that a pair of hands could pick up and manage towards a successful completion.
/triage accepted.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/triage accepted
/assign @dprotaso
/assign