serving
serving copied to clipboard
Consistent and Correct Conditions
Not all of our conditions follow our condition conventions.
We also don't propagate conditions in a consistent (or even correct) way: https://github.com/knative/serving/issues/4937
I'm opening this to track some work around wrangling our conditions.
- Resources that reconcile child resources should have a condition for each child that reflects the child's top-level "happy state" (TLHS) condition.
- This makes it easy to propagate conditions from our children.
- A resource's own TLHS condition should depend on those child conditions via ConditionSets.
- This makes it easy to propagate conditions to our parent.
- We should always update a resource's observed generation to indicate that we've attempted to reconcile its spec.
- This allows our parent to know if our status is up to date.
- If the observed generation of a child resource does not match its current generation, we should mark it as NotReconciled since its state does not yet reflect its spec yet.
- This ensures we don’t reflect stale child statuses.
- This might not make sense in some cases (e.g. a Revision reconciling a Deployment would flip back and forth between Ready and Unknown, given that the KPA will be constantly bumping the Deployment's Generation).
- If there is an non-terminal error during reconciliation, Ready should become unknown.
- This ensures we aren’t prematurely Ready and that we aren’t swallowing errors.
- If the observed generation does match, and we did not encounter an error, we should propagate the child’s TLHS condition into our own.
- This is the happy path.
In general, everything should look something like this:

Here's (a bit simplified) Service as an example. Service is pretty close to correct, but it's not unconditionally bumping ObservedGeneration or updating its status with reconciliation errors (bolded things aren't correct (yet)):

We'll also need to rework the Revision and KPA and SKS conditions to fit better into this world, since they're all over the place.
TODO
- [ ] Consistently update
ObservedGeneration - [ ] Surface reconciliation errors in conditions.
- [ ] Modify conditions to be in terms of child resources.
- [ ] Consistently propagate child conditions.
- [ ] Attempt to refactor/code-gen this stuff so we don't regress.
- [ ] Update sample-controller with parent->child best practices.
/area API
Thanks for the writeup @jonjohnsonjr. That makes a lot of sense to me!
@JRBANCEL pointed out some more inconsistencies in our lifecycle Mark* methods:
$ rg -e 'func .* Mark[^\(]+' -o --no-filename | sort | uniq
func (cs *CertificateStatus) MarkNotReady
func (cs *CertificateStatus) MarkReady
func (cs *CertificateStatus) MarkResourceNotOwned
func (cs *CertificateStatus) MarkUnknown
func (cs *ConfigurationStatus) MarkLatestCreatedFailed
func (cs *ConfigurationStatus) MarkLatestReadyDeleted
func (cs *ConfigurationStatus) MarkResourceNotConvertible
func (cs *ConfigurationStatus) MarkRevisionCreationFailed
func (is *IngressStatus) MarkLoadBalancerPending
func (is *IngressStatus) MarkLoadBalancerReady
func (is *IngressStatus) MarkNetworkConfigured
func (is *IngressStatus) MarkResourceNotOwned
func (pas *PodAutoscalerStatus) MarkActivating
func (pas *PodAutoscalerStatus) MarkActive
func (pas *PodAutoscalerStatus) MarkInactive
func (pas *PodAutoscalerStatus) MarkResourceFailedCreation
func (pas *PodAutoscalerStatus) MarkResourceNotOwned
func (rs *RevisionStatus) MarkActivating
func (rs *RevisionStatus) MarkActive
func (rs *RevisionStatus) MarkContainerExiting
func (rs *RevisionStatus) MarkContainerHealthy
func (rs *RevisionStatus) MarkContainerMissing
func (rs *RevisionStatus) MarkDeploying
func (rs *RevisionStatus) MarkInactive
func (rs *RevisionStatus) MarkProgressDeadlineExceeded
func (rs *RevisionStatus) MarkResourceNotConvertible
func (rs *RevisionStatus) MarkResourceNotOwned
func (rs *RevisionStatus) MarkResourcesAvailable
func (rs *RevisionStatus) MarkResourcesUnavailable
func (rs *RouteStatus) MarkCertificateNotOwned
func (rs *RouteStatus) MarkCertificateNotReady
func (rs *RouteStatus) MarkCertificateProvisionFailed
func (rs *RouteStatus) MarkCertificateReady
func (rs *RouteStatus) MarkConfigurationFailed
func (rs *RouteStatus) MarkConfigurationNotReady
func (rs *RouteStatus) MarkIngressNotConfigured
func (rs *RouteStatus) MarkMissingTrafficTarget
func (rs *RouteStatus) MarkResourceNotConvertible
func (rs *RouteStatus) MarkRevisionFailed
func (rs *RouteStatus) MarkRevisionNotReady
func (rs *RouteStatus) MarkServiceNotOwned
func (rs *RouteStatus) MarkTrafficAssigned
func (rs *RouteStatus) MarkUnknownTrafficError
func (ss *ServiceStatus) MarkConfigurationNotOwned
func (ss *ServiceStatus) MarkConfigurationNotReconciled
func (ss *ServiceStatus) MarkResourceNotConvertible
func (ss *ServiceStatus) MarkRevisionNameTaken
func (ss *ServiceStatus) MarkRouteNotOwned
func (ss *ServiceStatus) MarkRouteNotReconciled
func (ss *ServiceStatus) MarkRouteNotYetReady
func (sss *ServerlessServiceStatus) MarkActivatorEndpointsPopulated
func (sss *ServerlessServiceStatus) MarkActivatorEndpointsRemoved
func (sss *ServerlessServiceStatus) MarkEndpointsNotOwned
func (sss *ServerlessServiceStatus) MarkEndpointsNotReady
func (sss *ServerlessServiceStatus) MarkEndpointsReady
The naming is inconsistent: NotReady means Unknown, other times NotReady means failed. It's not clear from the method name what state you're actually setting.
For simple methods that just set the state I'd propose following this pattern, since it's most prevalent:
MarkFooReady -> Ready=True MarkFooNotReady -> Ready=Unknown MarkFooFailed -> Ready=False
There are a lot of convenience methods to make setting certain statuses more legible, we should probably leave those alone for now.
MarkFooReady -> Ready=True MarkFooNotReady -> Ready=Unknown MarkFooFailed -> Ready=False
We also need MarkFooNotOwned for the case where Foo resource is not owned by the parent resource, and MarkFooNotReconciled for the case where generation != observedGeneration. We probably want to add these two functions into our standard.
@dgerd opened an issue and a proposal (#5780) on how to deal with condition types and marking resource status. PR #5804 already implements that proposal for revision lifecycle and I plan to continue the refactoring for other resource status markers as well.
also this https://github.com/knative/serving/pull/5804#issuecomment-542805906 for cross referencing.
As part of #4937 I took a look at our current reconciler loops and put together a few state machines to help visualize the potential condition sets you can end up with. A few thoughts about them:
- We should not have Terminal states that exit without updating status. These are all opportunities to provide additional information/context back to the client and I can't think of a reason why we wouldn't want to. The reconciler knows something went wrong, but we are not sharing that information back.
- As an aside, it is very easy to return an error in the reconcilers and end up with more of these. This may be an opportunity to extend our controller framework to ensure that an error/success always gets propagated back.
- As mentioned in the issue referenced above, ObservedGeneration is inconsistent across resources. The only consistent guidance I have found from reading K8s docs and issues is that
controllers should report the most recent generation seen in status when they post status. From our API specification we state:The latest metadata.generation that the reconciler has observed. If observedGeneration is updated, conditions MUST be updated with current status in the same transaction.If we make all reconciler exits post status this would allow us to just bump observedGeneration from the controller framework at the beginning of the reconcile loop. The current state of the world is:- KService - Incremented at the end
- Configuration - Incremented after v1alpha1 conversion, but before everything else
- Route - Incremented on traffic errors, or at the end
- We use "Unknown" for (at least) 4 things. We should stop using it for iv.
- Initialization
- The observedGeneration of a child resource does not match (i.e. child not yet reconciled)
- We did not get around to looking at this resource (i.e. exited early due to previous dependency failing or not yet ready)
- Unknown error returned about object and we are not sure how to proceed
- We are mostly consistent about this, but we should stop posting Reason and Message for "True" conditions and we should always post a Reason and Message for False/Unknown conditions. We have wording in the API specification for always including a message for failures, but we have no wording about reasons.
- We should be more consistent about wording in "Reason". For example we do both "FooNotOwned" and "NotOwned" depending on the Resources and condition. I would be in favor of removing the stutter as the condition name already reflects the resource type.
I haven put together a Revision reconciler state machine yet, but here are the Service, Route, and Config ones.
KService Reconciler

Route Reconciler

Configuration Reconciler

Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close.
Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.
/lifecycle stale
Stale issues rot after 30 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.
Rotten issues close after an additional 30 days of inactivity.
If this issue is safe to close now please do so by adding the comment /close.
Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.
/lifecycle rotten
Rotten issues close after 30 days of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.
Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.
/close
@knative-housekeeping-robot: Closing this issue.
In response to this:
Rotten issues close after 30 days of inactivity. Reopen the issue with
/reopen. Mark the issue as fresh by adding the comment/remove-lifecycle rotten.Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-lifecycle rotten /reopen /lifecycle-frozen
@vagababov: Reopened this issue.
In response to this:
/remove-lifecycle rotten /reopen /lifecycle-frozen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/cc @dprotaso
related doc: https://docs.google.com/document/d/1tDPP43xSMIFdnw0D8msAXY9P6fEG5x9CG3kRhjpJh-4/edit#heading=h.dxlthq7ppqhh
/cc @whaught Weston, is this covered by the stuff you've been doing?
/cc @whaught Weston, is this covered by the stuff you've been doing?
I've dealt with some of this - ObservedGeneration is now automatically moved (even on failure) and we reset to unknown on new generations. There's probably more refactoring to do to ensure consistent naming and we do have some oddnesses around non-terminal conditions (ideally InitializeConditions would init everything, but conditionset only contains terminal conditions so we omit the rest).
Markus makes a good point about the way the 'Active' condition interacts between Revision and PA resources too.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/remove-lifecycle stale
/assign @dprotaso
Can you update the issue with some idea of the remaining work? It seems like there are 5 issues and 8 PRs referenced in this issue, along with a checklist with no boxes checked.
/triage accepted /remove-kind question /kind cleanup
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/remove-lifecycle stale
/lifecycle frozen