serving
serving copied to clipboard
ResourceQuota error isn't reflected in kservice status
In what area(s)?
/area API
What version of Knative?
HEAD
Expected Behavior
Default namespace contains a LimitRange that limits defaultRequest CPU to 100m. Created a ResourceQuota in the same namespace with CPU quota set to 50m. Tried to serve requests to an app deployed in the same namespace. Expected to see an error message when running kubectl get kservice or kubectl get pods saying that there was a failure since the resourcequota was exceeded.
Actual Behavior
Cannot hit the service (loading is stuck). kubectl get kservice shows the app as Ready, with no mention of the quota error in the status. No mention of pod creation failure. Only digging further down and looking at the yaml of the deployment shows the error.
Status of kservice:
status:
address:
hostname: testapp.default.svc.cluster.local
url: http://testapp.default.svc.cluster.local
conditions:
- lastTransitionTime: 2019-05-09T23:14:18Z
status: "True"
type: ConfigurationsReady
- lastTransitionTime: 2019-06-17T17:12:57Z
message: build cannot be migrated forward.
reason: build
severity: Warning
status: "False"
type: Convertible
- lastTransitionTime: 2019-05-09T23:14:19Z
status: "True"
type: Ready
- lastTransitionTime: 2019-05-09T23:14:19Z
status: "True"
type: RoutesReady
domain: testapp.default.example.com
domainInternal: testapp.default.svc.cluster.local
latestCreatedRevisionName: testapp-ncngm
latestReadyRevisionName: testapp-ncngm
observedGeneration: 1
traffic:
- latestRevision: true
percent: 100
revisionName: testapp-ncngm
url: http://testapp.default.example.com
Status of deployment:
status:
conditions:
- lastTransitionTime: 2019-05-09T23:14:08Z
lastUpdateTime: 2019-06-17T17:13:02Z
message: ReplicaSet "testapp-ncngm-deployment-6cbf59d7b9" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: 2019-06-18T22:06:36Z
lastUpdateTime: 2019-06-18T22:06:36Z
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
- lastTransitionTime: 2019-06-18T22:06:36Z
lastUpdateTime: 2019-06-18T22:06:36Z
message: 'pods "testapp-ncngm-deployment-6cbf59d7b9-cjrjl" is forbidden: exceeded
quota: new-cpu-quota, requested: cpu=225m, used: cpu=200m, limited: cpu=50m'
reason: FailedCreate
status: "True"
type: ReplicaFailure
observedGeneration: 6
unavailableReplicas: 1
Steps to Reproduce the Problem
- Create a limit range in a namespace, setting the default CPU (or any resource) to a value.
- Create ResourceQuota in the same namespace, setting the quota for the resource to a smaller value than the default.
- Try to serve requests from an app in the same namespace.
I believe the issue might be related to how the deployment is being reconciled. It looks like there is an "Error getting pods" message that is getting logged but the status of the revision/kservice do not get updated. Also the logic is checking that deployment.Status.AvailableReplicas == 0, which might not match all cases where pod creation has failed (for example, if 2 replicas have already been created, and the 3rd replica exceeds the ResourceQuota limit). Would it be possible to use the UnavailableReplicas value in the deployment instead?
Code for reference: https://github.com/knative/serving/blob/master//pkg/reconciler/revision/reconcile_resources.go#L36:22
Thanks for the bug report. This looks like something that we should bubble up into the Service status.
Added API label and moved into Serving 0.8.
/reopen due to this comment https://knative.slack.com/archives/CA4DNJ9A4/p1595251772209500
@dprotaso: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/unassign
/good-first-issue
It seems like this should be fairly easy to write a test for:
- Enable ResourceQuota on a namespace
- Deploy a Knative Service without resource requests
- Check the status on the created Revision, which should fail (possibly after 2 minutes)
/triage accepted
@evankanderson: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.
In response to this:
/good-first-issue
It seems like this should be fairly easy to write a test for:
- Enable ResourceQuota on a namespace
- Deploy a Knative Service without resource requests
- Check the status on the created Revision, which should fail (possibly after 2 minutes)
/triage accepted
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/assign @dprotaso
@dprotaso need any help on this?
@dprotaso are you still working on this one?
I started looking at this (and other related items yesterday)
@dprotaso Is this fixed in 1.4? I tested on Knative 1.4 it does not seem to be working.
I can see the error on deployment but not ksvc
status:
conditions:
- lastTransitionTime: "2022-06-19T12:18:00Z"
lastUpdateTime: "2022-06-19T12:18:00Z"
message: Created new replica set "torchserve-predictor-default-00001-deployment-75cd59b575"
reason: NewReplicaSetCreated
status: "True"
type: Progressing
- lastTransitionTime: "2022-06-19T12:18:00Z"
lastUpdateTime: "2022-06-19T12:18:00Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
- lastTransitionTime: "2022-06-19T12:18:00Z"
lastUpdateTime: "2022-06-19T12:18:00Z"
message: 'pods "torchserve-predictor-default-00001-deployment-75cd59b575-kpxgj"
is forbidden: failed quota: pods-high: must specify memory'
reason: FailedCreate
status: "True"
type: ReplicaFailure
status:
conditions:
- lastTransitionTime: "2022-06-19T12:17:59Z"
status: Unknown
type: ConfigurationsReady
- lastTransitionTime: "2022-06-19T12:17:59Z"
message: Configuration "torchserve-predictor-default" is waiting for a Revision
to become ready.
reason: RevisionMissing
status: Unknown
type: Ready
- lastTransitionTime: "2022-06-19T12:17:59Z"
message: Configuration "torchserve-predictor-default" is waiting for a Revision
to become ready.
reason: RevisionMissing
status: Unknown
type: RoutesReady
```
Yeah - odd - didn't mean to close this
One thing to note is that the failure won't show until the pod's progress deadline is exceeded. The default value is 10 minutes, so it'll take some time for it to fail (though the progress deadline can be configured lower either globally or on a per-revision basis).
That said, the error message still doesn't reference the resource quota issue. Instead, it will be something like Revision "hello-00001" failed with message: Initial scale was never achieved.
One thing to note is that the failure won't show until the pod's progress deadline is exceeded. The default value is 10 minutes, so it'll take some time for it to fail (though the progress deadline can be configured lower either globally or on a per-revision basis).
That said, the error message still doesn't reference the resource quota issue. Instead, it will be something like
Revision "hello-00001" failed with message: Initial scale was never achieved.
@psschwei any advice how to fix this? We’d like to contribute the fix if possible.
seems the deployment status is only propagated if the revision is active?
https://github.com/knative/serving/blob/main/pkg/reconciler/revision/reconcile_resources.go#L75
@psschwei any advice how to fix this? We’d like to contribute the fix if possible.
To get that info into the error message, we'd need to somehow propagate the deployment info into the initial scale error message (which is created here). Off the top of my head, not sure what the best way to do that would be...
Would it be an good first step to mirror that Deployment status in the associated Revision? We then still can think about how to propagate that back to a Service (which btw can be associated to multiple revisions via traffic split, so I don't necessary think that deployment-related errors should bubble up until the service, except when we would collect them in a list)
Would it be an good first step to mirror that Deployment status in the associated Revision?
I went back and looked at this on v1.6, and it looks like the quota errors are showing on both the revision and the service.
Revision:
$ k get revision -n rq-test hello-00001 -o json | jq .status.conditions
[
{
"lastTransitionTime": "2022-08-04T13:46:07Z",
"message": "The target is not receiving traffic.",
"reason": "NoTraffic",
"severity": "Info",
"status": "False",
"type": "Active"
},
{
"lastTransitionTime": "2022-08-04T13:45:27Z",
"status": "Unknown",
"type": "ContainerHealthy"
},
{
"lastTransitionTime": "2022-08-04T13:45:27Z",
"message": "pods \"hello-00001-deployment-54bf4b6774-g8l79\" is forbidden: exceeded quota: rq-e2e-test, requested: cpu=525m, used: cpu=0, limited: cpu=50m",
"reason": "FailedCreate",
"status": "False",
"type": "Ready"
},
{
"lastTransitionTime": "2022-08-04T13:45:27Z",
"message": "pods \"hello-00001-deployment-54bf4b6774-g8l79\" is forbidden: exceeded quota: rq-e2e-test, requested: cpu=525m, used: cpu=0, limited: cpu=50m",
"reason": "FailedCreate",
"status": "False",
"type": "ResourcesAvailable"
}
]
Service:
$ k get ksvc -n rq-test hello -o json | jq .status.conditions
[
{
"lastTransitionTime": "2022-08-04T13:45:27Z",
"message": "Revision \"hello-00001\" failed with message: pods \"hello-00001-deployment-54bf4b6774-g8l79\" is forbidden: exceeded quota: rq-e2e-test, requested: cpu=525m, used: cpu=0, limited: cpu=50m.",
"reason": "RevisionFailed",
"status": "False",
"type": "ConfigurationsReady"
},
{
"lastTransitionTime": "2022-08-04T13:45:27Z",
"message": "Configuration \"hello\" does not have any ready Revision.",
"reason": "RevisionMissing",
"status": "False",
"type": "Ready"
},
{
"lastTransitionTime": "2022-08-04T13:45:27Z",
"message": "Configuration \"hello\" does not have any ready Revision.",
"reason": "RevisionMissing",
"status": "False",
"type": "RoutesReady"
}
]
Off the top of my head, not sure what exactly changed between 1.4 and 1.6 to get these in there, but in there they are :smile:
ah, great. So I guess we could close this issue then ? Would be great to find out though when the fix went in ;-)
Checking on what the exact fix was... not showing in v1.5, so it was something in the last release
@dprotaso Is this issue still valid? Is there anyone working on it?
/unassign @dprotaso
I'm currently not working on this - it is up for grabs
/assign
/assign @xiangpingjiang Are you going to take over this issue?
@houshengbo yes, I want have a try
/assign
/assign
Probably fixed here: https://github.com/knative/serving/pull/14453 I'm testing to see if it's true
It does bubbles up the quota limit errors, too. I'm adding this as fixed issue to the PR
@gabo1208 were you able to follow up if your changes fixes this issue?
Let me test between today and Friday this exact case, but should be fixed, I'll update the issue with the results @dprotaso