Knative reconcile fails in-middle, but be `ready` eventually for service creation
In what area(s)?
/area API
What version of Knative?
0.18.x 0.19.x
Expected Behavior
What we observed
We observed multiple knative intermittents reconcile failure, that is the controller met some problems in-middle, and mark the knative service status as "Failed". But given the controllers will do retry, so the knative service itself becomes "ready=true" in the end.
Anyway, these intermittent errors gave the end-user a lot of confusion when using cli tool with the default "wait" approach, since the cli tool will report the fail whenever the "ready=false" detected (as reported in cli issue https://github.com/knative/client/issues/1023 as well)
Below are the reconcile failure in-middle issues. We can look at the details, and then discuss further solution:
Intermittent error 1
- Error Msg from
kn:
>> kn service create <ksvc-name> --image <image-name> ....
Creating service '<ksvc-name>' in namespace ...:
1.014s The Configuration is still working to reflect the latest desired specification.
30.665s Revision creation failed with message: Post "https://172.21.0.1:443/apis/serving.knative.dev/v1/namespaces/48sz6i7t4tc/revisions": http2: server sent GOAWAY and closed the connection; LastStreamID=355, ErrCode=NO_ERROR, debug="".
30.759s Configuration "<ksvc-name>" does not have any ready Revision.
31.539s Configuration "<ksvc-name>" is waiting for a Revision to become ready.
Error: RevisionFailed: Revision creation failed with message: Post "https://172.21.0.1:443/apis/serving.knative.dev/v1/namespaces/48sz6i7t4tc/revisions": http2: server sent GOAWAY and closed the connection; LastStreamID=355, ErrCode=NO_ERROR, debug="".
Run 'kn --help' for usage
- Key error detected in reconcile is the failure when posting to webhook.
'CreationFailed' Failed to create Revision: Internal error occurred: failed calling webhook \"webhook.serving.knative.dev\": Post https://webhook.knative-serving.svc:443/defaulting?timeout=10s: context deadline exceeded"
- After we got the error from
kn, if we wait for a while, the final status is ready.
kubectl get ksvc <ksvc-name> -o json
{
"address": {
"url": "http://<ksvc-name>.48sz6i7t4tc.svc.cluster.local"
},
"conditions": [
{
"lastTransitionTime": "2021-01-03T16:08:41Z",
"status": "True",
"type": "ConfigurationsReady"
},
{
"lastTransitionTime": "2021-01-03T16:08:43Z",
"status": "True",
"type": "Ready"
},
{
"lastTransitionTime": "2021-01-03T16:08:43Z",
"status": "True",
"type": "RoutesReady"
}
],
"latestCreatedRevisionName": "<ksvc-name>-vxgph-1",
"latestReadyRevisionName": "<ksvc-name>-vxgph-1",
"observedGeneration": 1,
"traffic": [
{
"latestRevision": true,
"percent": 100,
"revisionName": "<ksvc-name>-vxgph-1"
}
],
"url": "http://<ksvc-name>.48sz6i7t4tc.test.xxx.xxx"
}
Intermittent error 2
- Error Msg:
kn service create <ksvc-name> --image <image-name>
Creating service '<ksvc-name>' in namespace ...:
1.405s The Configuration is still working to reflect the latest desired specification.
6.503s The Route is still working to reflect the latest desired specification.
12.083s ...
15.248s Configuration "<ksvc-name>" is waiting for a Revision to become ready.
6.504s ...
9.279s Ingress has not yet been reconciled.
58.913s Ingress reconciliation failed
Error: ReconcileIngressFailed: Ingress reconciliation failed
Run 'kn --help' for usage
- As reported in case 1, we observed the webhook connection failure as well in logs.
failed to create VirtualService: Internal error occurred: failed to call webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: context deadline exceeded
- After we got the above error from
kn, if we wait for a while, the final status is ready.
Intermittent error 3
- Error Msg:
kn service create <ksvc-name> --image <image-name>
Creating service '<ksvc-name>' in namespace ...:
0.234s The Configuration is still working to reflect the latest desired specification.
0.235s Revision creation failed with message: request declared a Content-Length of 1546 but only wrote 0 bytes.
0.289s The Route is still working to reflect the latest desired specification.
0.654s Configuration "<ksvc-name>" does not have any ready Revision.
1.011s Configuration "<ksvc-name>" is waiting for a Revision to become ready.
Error: RevisionFailed: Revision creation failed with message: request declared a Content-Length of 1546 but only wrote 0 bytes.
Run 'kn --help' for usage
- Below is the related logs. I can't tell the exact error, but it still looks like a network issue:
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller debug {"level":"debug","ts":"2021-01-01T14:33:00.116Z","logger":"controller.service-controller.knative.dev-serving-pkg-reconciler-service.Reconciler","caller":"service/reconciler.go:330","msg":"Updating status with: v1.ServiceStatus{\n \tStatus: v1.Status{\n- \t\tObservedGeneration: 0,\n+ \t\tObservedGeneration: 1,\n- \t\tConditions: nil,\n+ \t\tConditions: v1.Conditions{\n+ \t\t\t{\n+ \t\t\t\tType: \"ConfigurationsReady\",\n+ \t\t\t\tStatus: \"Unknown\",\n+ \t\t\t\tLastTransitionTime: apis.VolatileTime{Inner: s\"2021-01-01 14:33:00.115982123 +0\"...},\n+ \t\t\t\tReason: \"OutOfDate\",\n+ \t\t\t\tMessage: \"The Configuration is still working to reflect the latest desired\"...,\n+ \t\t\t},\n+ \t\t\t{\n+ \t\t\t\tType: \"Ready\",\n+ \t\t\t\tStatus: \"Unknown\",\n+ \t\t\t\tLastTransitionTime: apis.VolatileTime{Inner: s\"2021-01-01 14:33:00.115982123 +0\"...},\n+ \t\t\t\tReason: \"OutOfDate\",\n+ \t\t\t\tMessage: \"The Configuration is still working to reflect the latest desired\"...,\n+ \t\t\t},\n+ \t\t\t{\n+ \t\t\t\tType: \"RoutesReady\",\n+ \t\t\t\tStatus: \"Unknown\",\n+ \t\t\t\tLastTransitionTime: apis.VolatileTime{Inner: s\"2021-01-01 14:33:00.115982123 +0\"...},\n+ \t\t\t},\n+ \t\t},\n \t\tAnnotations: nil,\n \t},\n \tConfigurationStatusFields: {},\n \tRouteStatusFields: {},\n }\n","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"service-controller","knative.dev/traceid":"fd00bc08-8f07-4793-af94-4813199901ff","knative.dev/key":"2y8lduk2d75/<ksvc-name>"}
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller info {"level":"info","ts":"2021-01-01T14:33:00.117Z","logger":"controller.service-controller.event-broadcaster","caller":"record/event.go:278","msg":"Event(v1.ObjectReference{Kind:\"Service\", Namespace:\"2y8lduk2d75\", Name:\"<ksvc-name>\", UID:\"433d081b-75d3-415b-94aa-ba277c7fe7e3\", APIVersion:\"serving.knative.dev/v1\", ResourceVersion:\"362985101\", FieldPath:\"\"}): type: 'Normal' reason: 'Created' Created Configuration \"<ksvc-name>\"","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"service-controller"}
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller debug {"level":"debug","ts":"2021-01-01T14:33:00.135Z","logger":"controller.configuration-controller.knative.dev-serving-pkg-reconciler-configuration.Reconciler","caller":"configuration/reconciler.go:330","msg":"Updating status with: v1.ConfigurationStatus{\n \tStatus: v1.Status{\n- \t\tObservedGeneration: 0,\n+ \t\tObservedGeneration: 1,\n- \t\tConditions: nil,\n+ \t\tConditions: v1.Conditions{\n+ \t\t\t{\n+ \t\t\t\tType: \"Ready\",\n+ \t\t\t\tStatus: \"False\",\n+ \t\t\t\tLastTransitionTime: apis.VolatileTime{Inner: s\"2021-01-01 14:33:00.134904845 +0\"...},\n+ \t\t\t\tReason: \"RevisionFailed\",\n+ \t\t\t\tMessage: \"Revision creation failed with message: request declared a Conten\"...,\n+ \t\t\t},\n+ \t\t},\n \t\tAnnotations: nil,\n \t},\n \tConfigurationStatusFields: {},\n }\n","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"configuration-controller","knative.dev/traceid":"794d6e85-0b19-49e1-9957-41634f6276d5","knative.dev/key":"2y8lduk2d75/<ksvc-name>"}
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller info {"level":"info","ts":"2021-01-01T14:33:00.135Z","logger":"controller.configuration-controller.event-broadcaster","caller":"record/event.go:278","msg":"Event(v1.ObjectReference{Kind:\"Configuration\", Namespace:\"2y8lduk2d75\", Name:\"<ksvc-name>\", UID:\"9a9913af-fee1-42f4-9769-b57f2b29c384\", APIVersion:\"serving.knative.dev/v1\", ResourceVersion:\"362985103\", FieldPath:\"\"}): type: 'Warning' reason: 'CreationFailed' Failed to create Revision: request declared a Content-Length of 1546 but only wrote 0 bytes","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"configuration-controller"}
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller debug {"level":"debug","ts":"2021-01-01T14:33:00.176Z","logger":"controller.service-controller.knative.dev-serving-pkg-reconciler-service.Reconciler","caller":"service/service.go:79","msg":"Configuration Conditions = v1.Conditions{apis.Condition{Type:\"Ready\", Status:\"False\", Severity:\"\", LastTransitionTime:apis.VolatileTime{Inner:v1.Time{Time:time.Time{wall:0x0, ext:63745108380, loc:(*time.Location)(0x3217da0)}}}, Reason:\"RevisionFailed\", Message:\"Revision creation failed with message: request declared a Content-Length of 1546 but only wrote 0 bytes.\"}}","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"service-controller","knative.dev/traceid":"6a78d5c7-ac6c-44a8-b48e-02e1170194a9","knative.dev/key":"2y8lduk2d75/<ksvc-name>"}
Ja
Jan 1 22:33:01 controller-64b86bcdd4-dm72t controller error {"level":"error","ts":"2021-01-01T14:33:00.178Z","logger":"controller.configuration-controller.knative.dev-serving-pkg-reconciler-configuration.Reconciler","caller":"configuration/reconciler.go:302","msg":"Returned an error","commit":"0e65ea7","knative.dev/pod":"controller-64b86bcdd4-dm72t","knative.dev/controller":"configuration-controller","knative.dev/traceid":"794d6e85-0b19-49e1-9957-41634f6276d5","knative.dev/key":"2y8lduk2d75/<ksvc-name>","targetMethod":"ReconcileKind","targetMethod":"ReconcileKind","error":"failed to create Revision: request declared a Content-Length of 1546 but only wrote 0 bytes","stacktrace":"knative.dev/serving/pkg/client/injection/reconciler/serving/v1/configuration.(*reconcilerImpl).Reconcile\n\tknative.dev/serving/pkg/client/injection/reconciler/serving/v1/configuration/reconciler.go:302\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/[email protected]/controller/controller.go:513\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/[email protected]/controller/controller.go:451"}
- After we got the above error from
kn, if we wait for a while, the final status is ready.
Summary
all of these 3 errors are related to network intermittent connection errors. As in a cloud environment, it is the nature that the network is not reliable all the time.
Given knative will do retry to tolerate the network issues, can we think out a better approach to facilitate the user experience from the serving side and client cmd to avoid the in-middle creation failure invisible to the end-user?
@ShiningTian and I work in the same team, so we have similar observations on the knative creation failure.
Besides the mentioned in-middle reconcile failure due to network issues,
I also reported
- https://github.com/knative/serving/issues/10076 for the pod creation failure due to scheduler preemption-eviction, which finally get ksvc ready as well.
On the other hand, I also reported
- https://github.com/knative/serving/issues/9857
which talks about the lower-level pod creation failure due to
limitrangedidn't be pickup byrevisioncontroller sometimes, so it kepts waiting for pod to be ready until Initial scale was never achieved.
I would like to propose to work on a solution by considering all these 3 issues together to improve the user experience.
Below is my proposal for the serving side changes:
From the serving side, given the reconcile will always happen when failure happens, I proposed to tolerate the in-middle errors, and not expose them on the ksvc CR status.ready.
That means, for the network error mentioned above and the pod schedule issue in #10076, ksvc controller will collect the reconcile failure, and update the Revision/Routes CR for the error.
BUT ksvc controller won't set ksvc CR status.ready to false.
As a result, the ksvc service CR will only be True or Unknown if the error can't recover with deadline exceeded.
If the final state is unknown, the end-user can check routes or revision CR to understand what exactly wrong, which can fulfill the situation in #9857 as well.
If it is not acceptable for the serving side, then we need to discuss for the kn site.
Also, in knative client-side, I reported
- https://github.com/knative/client/issues/1023
to discuss whether it is possible to add a toleration error window configuration, so that let knative client to wait for another XX seconds (per kn cmd options) if any
ksvc ready==falsedetected.
@julz @markusthoemmes @mattmoor ^^ :-)
I think the assumption here is that it is an unrecoverable failure: https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76
cc @dprotaso
/area API
given the controllers will do retry, so the knative service itself becomes "ready=true" in the end.
There's currently some gaps in our failure modes as you pointed out (https://github.com/knative/serving/issues/9857, https://github.com/knative/serving/issues/10076) where we don't report certain failures and revisions will end up being marked 'ready=True only when they are scaled to 0. I've been poking at this problem prior to the break.
The scenarios you highlighted are a bit different
'CreationFailed' Failed to create Revision: Internal error occurred: failed calling webhook "webhook.serving.knative.dev": Post https://webhook.knative-serving.svc:443/defaulting?timeout=10s: context deadline exceeded" failed to create VirtualService: Internal error occurred: failed to call webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: context deadline exceeded
Some of the errors you highlighted are failures with the K8s API invoking webooks. Was there anything abnormal with your knative & istio webhook deployments?
From the serving side, given the reconcile will always happen when failure happens, I proposed to tolerate the in-middle errors, and not expose them on the ksvc CR status.ready. ... I think the assumption here is that it is an unrecoverable failure:
We don't annotate creation errors as permanent but we should depending on the status - ie. if the API returns 4xx and potentially consider retries without marking the failure when a 5xx occurs
@dprotaso, yes, I agreed that there are retries, and I like the retries in current design , which make the ksvc eventually to be ready.
I just wonder whether we can avoid to report reconcile failure since these are not permanent errors.
I think the assumption here is that it is an unrecoverable failure:
https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76
cc @dprotaso
It looks like the request here would be to support a few rounds of retries before failing this (?)
/good-first-issue /triage accepted
@evankanderson: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.
In response to this:
I think the assumption here is that it is an unrecoverable failure:
https://github.com/knative/serving/blob/89cbb09674d51c229a555ad920585fd34d36c260/pkg/reconciler/configuration/configuration.go#L76
cc @dprotaso
It looks like the request here would be to support a few rounds of retries before failing this (?)
/good-first-issue /triage accepted
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I would like to contribute to this. Will read through more code logics on this and discuss the solution before go further.
When you're ready, you can /assign to assign the issue to yourself to let people know you're working on it.
Hello, we are experiencing the same kind of issue on Google Cloud Run. Do you have a workaround until a fix is found? I would be glad to help but not sure where to start nor what to do :/ Thanks you for your help 🙏🏻
@cdlliuy Hi! are you planning to work on this?
Hi @ShiningTian @evankanderson! Can I claim this issue if no one else is working on it?
The /assign command on is own line in a comment will assign the bug to the person making the comment. (You can also /assign <user> if you want.)
This bug looks open for you to tackle. 😉
/assign
Thanks @evankanderson!
@yuvaldolev gentle ping, any progress on this?
@yuvaldolev need any help to push this one forward?
/unassign @yuvaldolev due to inactivity
/assign
I can't figure out what I have to do, so I am unassigning myself
/unassign
/assign