crossplane-runtime icon indicating copy to clipboard operation
crossplane-runtime copied to clipboard

Status updates in `ExternalClient.Create` are omitted

Open muvaf opened this issue 4 years ago • 13 comments

What problem are you facing?

With https://github.com/crossplane/crossplane-runtime/pull/283, we issue a standard update in all cases and that causes the status to be lost since managed resources treat it as subresource. But there are cases where some information about the resource is available only in the response of the creation call to the API.

How could Crossplane help solve your problem?

We could possibly expand resource.Managed interface to have GetStatus() runtime.Object and SetStatus(runtime.Object) and then DeepCopy the status before the update operation and later call SetStatus on the returned object and issue a status update this time.

muvaf avatar Sep 02 '21 09:09 muvaf

@negz this issue causes Creating condition to never appear in status, which makes it feel like there isn't any progress.

muvaf avatar Sep 13 '21 15:09 muvaf

Whoops - good catch. I always look at events now so I didn't even notice. I think there's a simpler fix for status conditions specifically per #292. That won't cover the case in which a Create implementation itself is mutating status, but I don't think that is a pattern I've seen yet.

negz avatar Sep 15 '21 00:09 negz

I don't think that is a pattern I've seen yet.

We noted in https://github.com/crossplane/crossplane-runtime/pull/292#issuecomment-921777252 that this pattern does in fact exist, at least in some AWS and Azure controllers.

i-dont-think-they-really-exist

negz avatar Sep 17 '21 23:09 negz

We could possibly expand resource.Managed interface to have GetStatus() runtime.Object and SetStatus(runtime.Object) and then DeepCopy the status before the update operation and later call SetStatus on the returned object and issue a status update this time.

https://github.com/crossplane/crossplane-runtime/pull/280#discussion_r689872890

Hasan had a similar suggestion during review of a PR that lead to #283.

negz avatar Sep 17 '21 23:09 negz

Despite the guidance around the status field's "recreatability" being relaxed per https://github.com/kubernetes/enhancements/pull/2537 I think we should be warning folks to update status (or really any part of the managed resource) during the Create call as only as a last resort, given that we can't just call Create again on the next Reconcile if we fail to persist the result. It's always better to update the managed resource during Observe where possible.

That said, if we know there are edge cases where we have no alternative but to rely on persisting updates made during Create so it does seem like we need a solution here.

I don't love the idea of having to extract and save status so we can later persist it again, especially if it requires modifying the resource.Managed interface. 🤔 Some potential alternatives (that I don't particularly love either):

  1. We could consider Create updating status to be enough of an edge case that folks should plumb a Kubernetes client down into their ExternalClient to do so.
  2. We could call Status().Update() before we called UpdateCriticalAnnotations(). The risk being if the status update failed we'd fail to persist our critical annotation and get stuck waiting for human intervention. We could alleviate that by retrying the status update, and further alleviate it by adding a UpdatedStatus bool result to the ExternalObservation struct to allow us to avoid trying to persist status at all when there was no need to.

I think the alternative I like the best is still just trying to establish a "thou shalt not update anything (except annotations) during Create" rule. If I understand correctly we haven't yet found any AWS resources that make status updates in Create that they couldn't also make in Observe. We do know some Azure controllers rely on saving their operations, but it actually seems like they do that to workaround the fact that the external resources don't show up in the Azure API for some time after they're created, so we could probably just ignore the operation completely and use the new WithCreationGracePeriod option to cover that case.

negz avatar Sep 18 '21 00:09 negz

FYI, this causes reconciler to drop any status update made in Observe before the first successful Create call. In long-running creation calls, like AWS DBs, Azure DBs and async Terrajet calls, the error of the creation is returned in the GET call (either in a field of the returned object or as part of the call). So, we have to call status update in Observe to catch those creation errors.

I think the alternative I like the best is still just trying to establish a "thou shalt not update anything (except annotations) during Create" rule. If I understand correctly we haven't yet found any AWS resources that make status updates in Create that they couldn't also make in Observe.

That rule wouldn't apply in this case unfortunately, where creation error needs to be saved and Create needs to be called repeatedly until the problem is gone. But the solution doesn't have to be related to this issue specifically; we can issue a status update before external.Create call to make sure whatever is set in Observe is persisted.

muvaf avatar Oct 05 '21 22:10 muvaf

FYI, this causes reconciler to drop any status update made in Observe before the first successful Create call.

I'm working through this today, trying to determine how much of a blocker it could be for updating providers to the latest version of crossplane-runtime. First let me make sure I understand the situation. You're pointing out that in some cases a Create call will appear to succeed, but then later during Observe we'll find out that it didn't really. In this case we probably want to record this situation somehow to the managed resource status.

As far as I can tell, we still will update the status in most cases. For example:

  1. If during Observe we update the MR status then return ResourceExists: true (e.g. the resource was created but is in a bad state) we should fall through to the "resource is up-to-date" clause where we'll persist the status update. I think this is what the RDS example you linked does.
  2. If during Observe we return an error the reconciler should persist that error as a status condition (and event), bringing along any other status updates that were made before the error was returned. I think this is what the Azure Postgres example you linked does.

Is my understanding here correct, or am I missing something? Also, can you let me know whether Terrajet providers are still affected by this? I notice that https://github.com/crossplane-contrib/terrajet/pull/103 removed the block you referenced where we were previously recording a status condition during Observe.

negz avatar Dec 02 '21 22:12 negz

If during Observe we update the MR status then return ResourceExists: true

The case I'm referring to is where we have to return ResourceExists: false because creation actually failed (even though external.Create succeeded in earlier reconcile) and needs to be retried. But in the current implementation if we return ResourceExists: false at any time, the status is lost, which contained the error about the late-failed creation.

Also, can you let me know whether Terrajet providers are still affected by this? I notice that crossplane-contrib/terrajet#103 removed the block you referenced where we were previously recording a status condition during Observe.

That was a workaround and now we're doing a similar thing in callback mechanism by issuing a status update right after the operation completes, but that doesn't apply to native async resources since they don't have such callback mechanism from the APIs they consume. But for Terrajet, we should be good.

muvaf avatar Dec 03 '21 21:12 muvaf

The case I'm referring to is where we have to return ResourceExists: false because creation actually failed (even though external.Create succeeded in earlier reconcile) and needs to be retried.

Is this hypothetical, or do we do this in practice? As far as I could tell none of the examples you provided follow this pattern.

negz avatar Dec 03 '21 22:12 negz

@negz We do that in Terrajet in this line to trigger a new creation, here in Azure PostgreSQL.

In fact, I think it can be considered a bug if we don't do that because then it'd mean we just never retry once a long-running creation fails. We don't see that very often since cloud providers like AWS and GCP make their best to do all validations in the creation calls, albeit there are always cases in those, too, where validation passes but creation fails due to external factors like insufficient capacity.

muvaf avatar Dec 03 '21 23:12 muvaf

@negz Opened https://github.com/crossplane/crossplane-runtime/issues/307 to be more specific about the problem I mentioned above. We can keep this issue contained to status updates being omitted after ExternalClient.Create.

muvaf avatar Dec 04 '21 14:12 muvaf

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 13 '22 07:08 stale[bot]

Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as stale because it has had no activity in the last 90 days. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

github-actions[bot] avatar Sep 04 '24 01:09 github-actions[bot]