serving icon indicating copy to clipboard operation
serving copied to clipboard

Failed to sync with `ReplicaFailure` in ksvc creation sometimes

Open cdlliuy opened this issue 4 years ago • 14 comments

The problem happens in release 0.17, but should not be a regression issue on 0.17.

When creating knative application in a namespace in which limit range min/max specified, i.e. limit range min for cpu 10m, sometimes I can get the expected error msg of 'pod creation forbidden', but sometimes not and just knative application creation failed with ProgressDeadlineExceeded.

This is the output for the expected behaviour:

$ kn service create test3 --image docker.io/cdlliuy/kn-helloworld -n ca482111-7675 --request cpu=1m --limit cpu=1m  --force
Replacing service 'test3' in namespace 'ca482111-7675':
  0.363s Configuration "test3" is waiting for a Revision to become ready.
  2.084s Revision "test3-bqdlg-1" failed with message: pods "test3-bqdlg-1-deployment-7dcfc469f6-658tj" is forbidden: minimum cpu usage per Container is 10m, but request is 1m.
  2.121s Configuration "test3" does not have any ready Revision.
  2.315s ...
  2.356s Configuration "test3" is waiting for a Revision to become ready.
Error: RevisionFailed: Revision "test3-bqdlg-1" failed with message: pods "test3-bqdlg-1-deployment-7dcfc469f6-658tj" is forbidden: minimum cpu usage per Container is 10m, but request is 1m.
Run 'kn --help' for usage

But with similar cmd (just another ksvc name), it hangs..

$ kn service create test4 --image docker.io/cdlliuy/kn-helloworld -n ca482111-7675 --request cpu=1m --limit cpu=1m  --force
Creating service 'test4' in namespace 'ca482111-7675':
  0.219s The Route is still working to reflect the latest desired specification.
  0.291s Configuration "test4" is waiting for a Revision to become ready.
^C

Checking the deployment status of the latter one, the ReplicaFailure is caught.

  - lastTransitionTime: "2020-10-19T05:33:57Z"
    lastUpdateTime: "2020-10-19T05:33:57Z"
    message: 'pods "test4-hdqhd-1-deployment-576b96bc76-rb6tj" is forbidden: minimum
      cpu usage per Container is 10m, but request is 1m'
    reason: FailedCreate
    status: "True"
    type: ReplicaFailure

But for revision ..

  - lastTransitionTime: "2020-10-19T05:33:57Z"
    reason: Deploying
    status: Unknown
    type: ContainerHealthy
  - lastTransitionTime: "2020-10-19T05:36:28Z"
    message: Initial scale was never achieved
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Ready
  - lastTransitionTime: "2020-10-19T05:36:28Z"
    message: Initial scale was never achieved
    reason: ProgressDeadlineExceeded
    status: "False"
    type: ResourcesAvailable

In knative controller log output, given there is no enough logs exposed in https://github.com/knative/serving/blob/release-0.17/pkg/reconciler/revision/reconcile_resources.go#L62-L78, it is hard to say whether the deployment status changes triggered the revision reconcile in the unexpected case.

I think it is a kind of race condition. Any insight ?

cdlliuy avatar Oct 19 '20 05:10 cdlliuy

@cdlliuy thanks for the report - I'll take a look tomorrow (I'm on EST) and see if I can reproduce the issue

dprotaso avatar Oct 20 '20 02:10 dprotaso

When creating knative application in a namespace in which limit range min/max specified,

Are you setting a ResourceQuota or LimitRange on the namespace? Do you have the example yaml?

dprotaso avatar Oct 20 '20 19:10 dprotaso

Also what version of K8s are you are running

dprotaso avatar Oct 20 '20 19:10 dprotaso

@dprotaso , I am running on k8s v0.17 with limit range :

spec:
  limits:
  - default:
      cpu: 100m
    defaultRequest:
      cpu: 100m
    max:
      cpu: "8"
    min:
      cpu: 10m
    type: Container

The resource quota is set but won't take effect in this case, since I am running with a very small cpu request number

cdlliuy avatar Oct 21 '20 12:10 cdlliuy

Great thanks - I'll take a look later today

dprotaso avatar Oct 21 '20 12:10 dprotaso

So I was able to repro on Kind and this script. Looking at the code we don't propagate the deployment status to the revision unless it's 'active' so it'll time out. I don't recall what triggers the revision becoming active.

I wasn't able to see your first error where the status was propagated correctly - ie.forbidden: minimum cpu usage per Container is 10m, but request is 1m.

I do see the revision becomes 'Ready: True' after the autoscaler scales the deployment to zero. But since we never reached our initial scale this is misleading.

I'm going to throw this into the current release (v0.19) for someone to pick up. Otherwise I'll pick it up for v0.20

dprotaso avatar Oct 22 '20 15:10 dprotaso

Potentially Related: https://github.com/knative/serving/issues/8540

dprotaso avatar Oct 22 '20 16:10 dprotaso

@dprotaso , I think the similar issue also happens when the namespace has a resource quota or the cluster resource is exhausted.
Sometimes, the resource quota breached or insufficient resource is thrown out to ksvc service CR layer, so that the end user can aware of it. But sometimes, it just got stuck with revisionMissing. Then the end-user need to dive down until the replicaset layer to find out the failure reason.

Can you share what is the root cause of the issue in your head? Maybe we can also contribute some efforts to get this issue fixed. Anyway, currently, I don't have any idea why it fails.

cdlliuy avatar Nov 05 '20 12:11 cdlliuy

@dprotaso , can you help to share a little bit of the idea to fix this?

cdlliuy avatar Dec 07 '20 07:12 cdlliuy

/good-first-issue /area API

Given that there's a repro script and @dprotaso wants to get this into a release, I'm guessing that this is something that a pair of hands could pick up and manage towards a successful completion.

/triage accepted.

evankanderson avatar Mar 22 '21 01:03 evankanderson

@evankanderson: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue /area API

Given that there's a repro script and @dprotaso wants to get this into a release, I'm guessing that this is something that a pair of hands could pick up and manage towards a successful completion.

/triage accepted.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot avatar Mar 22 '21 01:03 knative-prow-robot

@evankanderson: The label(s) triage/accepted. cannot be applied, because the repository doesn't have them.

In response to this:

/good-first-issue /area API

Given that there's a repro script and @dprotaso wants to get this into a release, I'm guessing that this is something that a pair of hands could pick up and manage towards a successful completion.

/triage accepted.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot avatar Mar 22 '21 01:03 knative-prow-robot

/triage accepted

evankanderson avatar Mar 22 '21 01:03 evankanderson

/assign @dprotaso

dprotaso avatar Apr 07 '21 17:04 dprotaso

/assign

gabo1208 avatar Sep 19 '23 16:09 gabo1208