serving icon indicating copy to clipboard operation
serving copied to clipboard

Knative reconcile fails in-middle, but be `ready` eventually for service creation

Open ShiningTian opened this issue 4 years ago • 22 comments

In what area(s)?

/area API

What version of Knative?

0.18.x 0.19.x

Expected Behavior

What we observed

We observed multiple knative intermittents reconcile failure, that is the controller met some problems in-middle, and mark the knative service status as "Failed". But given the controllers will do retry, so the knative service itself becomes "ready=true" in the end.

Anyway, these intermittent errors gave the end-user a lot of confusion when using cli tool with the default "wait" approach, since the cli tool will report the fail whenever the "ready=false" detected (as reported in cli issue https://github.com/knative/client/issues/1023 as well)

Below are the reconcile failure in-middle issues. We can look at the details, and then discuss further solution:

Intermittent error 1
  • Error Msg from kn:
>> kn service create <ksvc-name> --image <image-name> ....
Creating service '<ksvc-name>' in namespace ...:

  1.014s The Configuration is still working to reflect the latest desired specification.
 30.665s Revision creation failed with message: Post "https://172.21.0.1:443/apis/serving.knative.dev/v1/namespaces/48sz6i7t4tc/revisions": http2: server sent GOAWAY and closed the connection; LastStreamID=355, ErrCode=NO_ERROR, debug="".
 30.759s Configuration "<ksvc-name>" does not have any ready Revision.
 31.539s Configuration "<ksvc-name>" is waiting for a Revision to become ready.
Error: RevisionFailed: Revision creation failed with message: Post "https://172.21.0.1:443/apis/serving.knative.dev/v1/namespaces/48sz6i7t4tc/revisions": http2: server sent GOAWAY and closed the connection; LastStreamID=355, ErrCode=NO_ERROR, debug="".
Run 'kn --help' for usage
  • Key error detected in reconcile is the failure when posting to webhook.
'CreationFailed' Failed to create Revision: Internal error occurred: failed calling webhook \"webhook.serving.knative.dev\": Post https://webhook.knative-serving.svc:443/defaulting?timeout=10s: context deadline exceeded"
  • After we got the error from kn, if we wait for a while, the final status is ready.
kubectl get ksvc <ksvc-name> -o json
{
  "address": {
    "url": "http://<ksvc-name>.48sz6i7t4tc.svc.cluster.local"
  },
  "conditions": [
    {
      "lastTransitionTime": "2021-01-03T16:08:41Z",
      "status": "True",
      "type": "ConfigurationsReady"
    },
    {
      "lastTransitionTime": "2021-01-03T16:08:43Z",
      "status": "True",
      "type": "Ready"
    },
    {
      "lastTransitionTime": "2021-01-03T16:08:43Z",
      "status": "True",
      "type": "RoutesReady"
    }
  ],
  "latestCreatedRevisionName": "<ksvc-name>-vxgph-1",
  "latestReadyRevisionName": "<ksvc-name>-vxgph-1",
  "observedGeneration": 1,
  "traffic": [
    {
      "latestRevision": true,
      "percent": 100,
      "revisionName": "<ksvc-name>-vxgph-1"
    }
  ],
  "url": "http://<ksvc-name>.48sz6i7t4tc.test.xxx.xxx"
}
Intermittent error 2
  • Error Msg:
kn service create <ksvc-name> --image <image-name>  
Creating service '<ksvc-name>' in namespace ...:

  1.405s The Configuration is still working to reflect the latest desired specification.
  6.503s The Route is still working to reflect the latest desired specification.
 12.083s ...
 15.248s Configuration "<ksvc-name>" is waiting for a Revision to become ready.
  6.504s ...
  9.279s Ingress has not yet been reconciled.
 58.913s Ingress reconciliation failed
Error: ReconcileIngressFailed: Ingress reconciliation failed
Run 'kn --help' for usage
  • As reported in case 1, we observed the webhook connection failure as well in logs.
failed to create VirtualService: Internal error occurred: failed to call webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: context deadline exceeded
  • After we got the above error from kn, if we wait for a while, the final status is ready.
Intermittent error 3
  • Error Msg:
kn service create <ksvc-name>  --image <image-name> 
Creating service '<ksvc-name>' in namespace ...:

  0.234s The Configuration is still working to reflect the latest desired specification.
  0.235s Revision creation failed with message: request declared a Content-Length of 1546 but only wrote 0 bytes.
  0.289s The Route is still working to reflect the latest desired specification.
  0.654s Configuration "<ksvc-name>" does not have any ready Revision.
  1.011s Configuration "<ksvc-name>" is waiting for a Revision to become ready.
Error: RevisionFailed: Revision creation failed with message: request declared a Content-Length of 1546 but only wrote 0 bytes.
Run 'kn --help' for usage
  • Below is the related logs. I can't tell the exact error, but it still looks like a network issue:
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller debug {"level":"debug","ts":"2021-01-01T14:33:00.116Z","logger":"controller.service-controller.knative.dev-serving-pkg-reconciler-service.Reconciler","caller":"service/reconciler.go:330","msg":"Updating status with:   v1.ServiceStatus{\n  \tStatus: v1.Status{\n- \t\tObservedGeneration: 0,\n+ \t\tObservedGeneration: 1,\n- \t\tConditions:         nil,\n+ \t\tConditions: v1.Conditions{\n+ \t\t\t{\n+ \t\t\t\tType:               \"ConfigurationsReady\",\n+ \t\t\t\tStatus:             \"Unknown\",\n+ \t\t\t\tLastTransitionTime: apis.VolatileTime{Inner: s\"2021-01-01 14:33:00.115982123 +0\"...},\n+ \t\t\t\tReason:             \"OutOfDate\",\n+ \t\t\t\tMessage:            \"The Configuration is still working to reflect the latest desired\"...,\n+ \t\t\t},\n+ \t\t\t{\n+ \t\t\t\tType:               \"Ready\",\n+ \t\t\t\tStatus:             \"Unknown\",\n+ \t\t\t\tLastTransitionTime: apis.VolatileTime{Inner: s\"2021-01-01 14:33:00.115982123 +0\"...},\n+ \t\t\t\tReason:             \"OutOfDate\",\n+ \t\t\t\tMessage:            \"The Configuration is still working to reflect the latest desired\"...,\n+ \t\t\t},\n+ \t\t\t{\n+ \t\t\t\tType:               \"RoutesReady\",\n+ \t\t\t\tStatus:             \"Unknown\",\n+ \t\t\t\tLastTransitionTime: apis.VolatileTime{Inner: s\"2021-01-01 14:33:00.115982123 +0\"...},\n+ \t\t\t},\n+ \t\t},\n  \t\tAnnotations: nil,\n  \t},\n  \tConfigurationStatusFields: {},\n  \tRouteStatusFields:         {},\n  }\n","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"service-controller","knative.dev/traceid":"fd00bc08-8f07-4793-af94-4813199901ff","knative.dev/key":"2y8lduk2d75/<ksvc-name>"}
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller info {"level":"info","ts":"2021-01-01T14:33:00.117Z","logger":"controller.service-controller.event-broadcaster","caller":"record/event.go:278","msg":"Event(v1.ObjectReference{Kind:\"Service\", Namespace:\"2y8lduk2d75\", Name:\"<ksvc-name>\", UID:\"433d081b-75d3-415b-94aa-ba277c7fe7e3\", APIVersion:\"serving.knative.dev/v1\", ResourceVersion:\"362985101\", FieldPath:\"\"}): type: 'Normal' reason: 'Created' Created Configuration \"<ksvc-name>\"","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"service-controller"}
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller debug {"level":"debug","ts":"2021-01-01T14:33:00.135Z","logger":"controller.configuration-controller.knative.dev-serving-pkg-reconciler-configuration.Reconciler","caller":"configuration/reconciler.go:330","msg":"Updating status with:   v1.ConfigurationStatus{\n  \tStatus: v1.Status{\n- \t\tObservedGeneration: 0,\n+ \t\tObservedGeneration: 1,\n- \t\tConditions:         nil,\n+ \t\tConditions: v1.Conditions{\n+ \t\t\t{\n+ \t\t\t\tType:               \"Ready\",\n+ \t\t\t\tStatus:             \"False\",\n+ \t\t\t\tLastTransitionTime: apis.VolatileTime{Inner: s\"2021-01-01 14:33:00.134904845 +0\"...},\n+ \t\t\t\tReason:             \"RevisionFailed\",\n+ \t\t\t\tMessage:            \"Revision creation failed with message: request declared a Conten\"...,\n+ \t\t\t},\n+ \t\t},\n  \t\tAnnotations: nil,\n  \t},\n  \tConfigurationStatusFields: {},\n  }\n","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"configuration-controller","knative.dev/traceid":"794d6e85-0b19-49e1-9957-41634f6276d5","knative.dev/key":"2y8lduk2d75/<ksvc-name>"}
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller info {"level":"info","ts":"2021-01-01T14:33:00.135Z","logger":"controller.configuration-controller.event-broadcaster","caller":"record/event.go:278","msg":"Event(v1.ObjectReference{Kind:\"Configuration\", Namespace:\"2y8lduk2d75\", Name:\"<ksvc-name>\", UID:\"9a9913af-fee1-42f4-9769-b57f2b29c384\", APIVersion:\"serving.knative.dev/v1\", ResourceVersion:\"362985103\", FieldPath:\"\"}): type: 'Warning' reason: 'CreationFailed' Failed to create Revision: request declared a Content-Length of 1546 but only wrote 0 bytes","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"configuration-controller"} 
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller debug {"level":"debug","ts":"2021-01-01T14:33:00.176Z","logger":"controller.service-controller.knative.dev-serving-pkg-reconciler-service.Reconciler","caller":"service/service.go:79","msg":"Configuration Conditions = v1.Conditions{apis.Condition{Type:\"Ready\", Status:\"False\", Severity:\"\", LastTransitionTime:apis.VolatileTime{Inner:v1.Time{Time:time.Time{wall:0x0, ext:63745108380, loc:(*time.Location)(0x3217da0)}}}, Reason:\"RevisionFailed\", Message:\"Revision creation failed with message: request declared a Content-Length of 1546 but only wrote 0 bytes.\"}}","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"service-controller","knative.dev/traceid":"6a78d5c7-ac6c-44a8-b48e-02e1170194a9","knative.dev/key":"2y8lduk2d75/<ksvc-name>"}
Ja
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller error {"level":"error","ts":"2021-01-01T14:33:00.178Z","logger":"controller.configuration-controller.knative.dev-serving-pkg-reconciler-configuration.Reconciler","caller":"configuration/reconciler.go:302","msg":"Returned an error","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"configuration-controller","knative.dev/traceid":"794d6e85-0b19-49e1-9957-41634f6276d5","knative.dev/key":"2y8lduk2d75/<ksvc-name>","targetMethod":"ReconcileKind","targetMethod":"ReconcileKind","error":"failed to create Revision: request declared a Content-Length of 1546 but only wrote 0 bytes","stacktrace":"knative.dev/serving/pkg/client/injection/reconciler/serving/v1/configuration.(*reconcilerImpl).Reconcile\n\tknative.dev/serving/pkg/client/injection/reconciler/serving/v1/configuration/reconciler.go:302\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/[email protected]/controller/controller.go:513\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/[email protected]/controller/controller.go:451"} 
  • After we got the above error from kn, if we wait for a while, the final status is ready.

Summary

all of these 3 errors are related to network intermittent connection errors. As in a cloud environment, it is the nature that the network is not reliable all the time.

Given knative will do retry to tolerate the network issues, can we think out a better approach to facilitate the user experience from the serving side and client cmd to avoid the in-middle creation failure invisible to the end-user?

ShiningTian avatar Jan 11 '21 02:01 ShiningTian

@ShiningTian and I work in the same team, so we have similar observations on the knative creation failure.

Besides the mentioned in-middle reconcile failure due to network issues,
I also reported

  • https://github.com/knative/serving/issues/10076 for the pod creation failure due to scheduler preemption-eviction, which finally get ksvc ready as well.

On the other hand, I also reported

  • https://github.com/knative/serving/issues/9857 which talks about the lower-level pod creation failure due to limitrange didn't be pickup by revision controller sometimes, so it kepts waiting for pod to be ready until Initial scale was never achieved.

I would like to propose to work on a solution by considering all these 3 issues together to improve the user experience.

Below is my proposal for the serving side changes:

From the serving side, given the reconcile will always happen when failure happens, I proposed to tolerate the in-middle errors, and not expose them on the ksvc CR status.ready.

That means, for the network error mentioned above and the pod schedule issue in #10076, ksvc controller will collect the reconcile failure, and update the Revision/Routes CR for the error.
BUT ksvc controller won't set ksvc CR status.ready to false.
As a result, the ksvc service CR will only be True or Unknown if the error can't recover with deadline exceeded. If the final state is unknown, the end-user can check routes or revision CR to understand what exactly wrong, which can fulfill the situation in #9857 as well.

If it is not acceptable for the serving side, then we need to discuss for the kn site. Also, in knative client-side, I reported

  • https://github.com/knative/client/issues/1023 to discuss whether it is possible to add a toleration error window configuration, so that let knative client to wait for another XX seconds (per kn cmd options) if any ksvc ready==false detected.

cdlliuy avatar Jan 11 '21 02:01 cdlliuy

@julz @markusthoemmes @mattmoor ^^ :-)

cdlliuy avatar Jan 13 '21 05:01 cdlliuy

I think the assumption here is that it is an unrecoverable failure: https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76

cc @dprotaso

mattmoor avatar Jan 14 '21 22:01 mattmoor

/area API

given the controllers will do retry, so the knative service itself becomes "ready=true" in the end.

There's currently some gaps in our failure modes as you pointed out (https://github.com/knative/serving/issues/9857, https://github.com/knative/serving/issues/10076) where we don't report certain failures and revisions will end up being marked 'ready=True only when they are scaled to 0. I've been poking at this problem prior to the break.

The scenarios you highlighted are a bit different

'CreationFailed' Failed to create Revision: Internal error occurred: failed calling webhook "webhook.serving.knative.dev": Post https://webhook.knative-serving.svc:443/defaulting?timeout=10s: context deadline exceeded" failed to create VirtualService: Internal error occurred: failed to call webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: context deadline exceeded

Some of the errors you highlighted are failures with the K8s API invoking webooks. Was there anything abnormal with your knative & istio webhook deployments?

From the serving side, given the reconcile will always happen when failure happens, I proposed to tolerate the in-middle errors, and not expose them on the ksvc CR status.ready. ... I think the assumption here is that it is an unrecoverable failure:

We don't annotate creation errors as permanent but we should depending on the status - ie. if the API returns 4xx and potentially consider retries without marking the failure when a 5xx occurs

dprotaso avatar Jan 18 '21 21:01 dprotaso

@dprotaso, yes, I agreed that there are retries, and I like the retries in current design , which make the ksvc eventually to be ready.

I just wonder whether we can avoid to report reconcile failure since these are not permanent errors.

cdlliuy avatar Jan 19 '21 07:01 cdlliuy

I think the assumption here is that it is an unrecoverable failure:

https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76

cc @dprotaso

It looks like the request here would be to support a few rounds of retries before failing this (?)

/good-first-issue /triage accepted

evankanderson avatar Mar 16 '21 01:03 evankanderson

@evankanderson: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

I think the assumption here is that it is an unrecoverable failure:

https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76

cc @dprotaso

It looks like the request here would be to support a few rounds of retries before failing this (?)

/good-first-issue /triage accepted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot avatar Mar 16 '21 01:03 knative-prow-robot

I would like to contribute to this. Will read through more code logics on this and discuss the solution before go further.

cdlliuy avatar Mar 16 '21 08:03 cdlliuy

When you're ready, you can /assign to assign the issue to yourself to let people know you're working on it.

evankanderson avatar Mar 19 '21 06:03 evankanderson

Hello, we are experiencing the same kind of issue on Google Cloud Run. Do you have a workaround until a fix is found? I would be glad to help but not sure where to start nor what to do :/ Thanks you for your help 🙏🏻

benja-M-1 avatar Jun 03 '21 13:06 benja-M-1

@cdlliuy Hi! are you planning to work on this?

skonto avatar Jul 28 '21 14:07 skonto

Hi @ShiningTian @evankanderson! Can I claim this issue if no one else is working on it?

yuvaldolev avatar Oct 18 '21 06:10 yuvaldolev

The /assign command on is own line in a comment will assign the bug to the person making the comment. (You can also /assign <user> if you want.)

This bug looks open for you to tackle. 😉

evankanderson avatar Oct 22 '21 14:10 evankanderson

/assign

yuvaldolev avatar Oct 22 '21 14:10 yuvaldolev

Thanks @evankanderson!

yuvaldolev avatar Oct 22 '21 14:10 yuvaldolev

@yuvaldolev gentle ping, any progress on this?

skonto avatar Nov 16 '21 10:11 skonto

@yuvaldolev need any help to push this one forward?

skonto avatar Nov 22 '21 13:11 skonto

/unassign @yuvaldolev due to inactivity

dprotaso avatar Dec 06 '22 15:12 dprotaso

/assign

vishal-chdhry avatar Jan 26 '23 04:01 vishal-chdhry

I can't figure out what I have to do, so I am unassigning myself

vishal-chdhry avatar Feb 08 '23 04:02 vishal-chdhry

/unassign

vishal-chdhry avatar Feb 08 '23 04:02 vishal-chdhry

/assign

maryfrances01 avatar Mar 03 '23 22:03 maryfrances01