operator-controller icon indicating copy to clipboard operation
operator-controller copied to clipboard

Creating ClusterCatalog hangs for a while, without any signal for what's going on.

Open anik120 opened this issue 1 year ago • 4 comments

I created a ClusterCatalog with a private image from redhat.registry.io, but it hung for a while without any signal about what's going on..

NAME                         LASTUNPACKED   SERVING   AGE
redhat-community-operators   1s             True      14m

Notice the last unpacked (1s) vs the age (14m). When I inspected the logs, I saw that the unpacker ran into intermittent issues, before it was able to pull the image successfully. For the first 14m though, there was no signal on the CR for what's going on (you can see the unpacker struggling if you inspect the controller logs though).

We should restructure our controller to be a state machine, so that the unpacking operation can be transitioned through the different states, instead of the entire operation being an all-in blocking operation.

The different states for the transition machine could be something like:

Initiated->Pulling->Unpacking->Unpacked

anik120 avatar Oct 14 '24 15:10 anik120

I believe this is a result of synchronous image pulling. The fix for this should be to design and implement async image pulling. At that point, the reconciler could kick off a background goroutine to pull an image, and then immediately update status to say "Pulling" somehow.

Then later when the image pull is complete (or fails), we could signal the reconciler to tell it to reconcile again at which point it could find the result and continue unpacking/processing it.

joelanford avatar Oct 15 '24 15:10 joelanford

@joelanford you'd mentioned this call but I don't think I picked up your points that well in the call:

What was the reason you mentioned again for not doing a state machine again? What you're saying here

At that point, the reconciler could kick off a background goroutine to pull an image, and then immediately update status to say "Pulling" somehow. Then later when the image pull is complete (or fails), we could signal the reconciler to tell it to reconcile again at which point it could find the result and continue unpacking/processing it.

seems much easier to do if we just requeueAfter in each state, until we reach a "final" state (of success or error), with exponential backoff in case of errors.

This is a level-based design too.

we could signal the reconciler to tell it to reconcile again

sounds like you're suggesting an edge-based design.

anik120 avatar Oct 16 '24 13:10 anik120

Maybe I'm misinterpreting what you mean by a state machine design, but I've seen cases where controllers are implemented as state machines where they make assumptions that the status of an object is correct and then execute a state's logic based on that assumption. But if the status is incorrect, then the logic that is executed is also incorrect and/or incomplete, and that might mean we never successfully run other parts of the state machine that we think will always run.

The correct design of a controller is to read the actual and current state of the world and then do as much as possible to drive toward that desired state.

we could signal the reconciler to tell it to reconcile again

sounds like you're suggesting an edge-based design.

Not quite. Think of the async image puller a lot like the apiserver and our client that fetches from it, and the watches we register with it.

  1. We ask it: "do you have the data we expect?" If so, great we use it and continue on with that data.
  2. If it doesn't have the data we expect, we ask it to fetch that data in the background (kind of like how we used to create a unpack pod), and then update our status to say "don't have it yet, but it's being fetched"
  3. Just like when the apiserver sends an event via an informer about a change in an object in etcd, we could have something like an informer against the async image puller, and we'd register that as a source.Channel (or similar) on our controller such that we would just get a generic event for "this catalog needs to be reconciled"

So then anytime we reconcile, no matter why we're reconciling, we're looking at current state (what is in our APIserver cache, what is in our image cache) and then taking action based on what we see there.

joelanford avatar Oct 18 '24 14:10 joelanford

I think I see what you mean, let me try to find some time to prototype this. I'll update here once the prototype is ready.

anik120 avatar Oct 28 '24 16:10 anik120

Issues go stale after 90 days of inactivity. If there is no further activity, the issue will be closed in another 30 days.

github-actions[bot] avatar Jul 02 '25 01:07 github-actions[bot]

This issue has been closed due to inactivity.

github-actions[bot] avatar Aug 01 '25 01:08 github-actions[bot]