operator-controller
operator-controller copied to clipboard
Creating ClusterCatalog hangs for a while, without any signal for what's going on.
I created a ClusterCatalog with a private image from redhat.registry.io, but it hung for a while without any signal about what's going on..
NAME LASTUNPACKED SERVING AGE
redhat-community-operators 1s True 14m
Notice the last unpacked (1s) vs the age (14m). When I inspected the logs, I saw that the unpacker ran into intermittent issues, before it was able to pull the image successfully. For the first 14m though, there was no signal on the CR for what's going on (you can see the unpacker struggling if you inspect the controller logs though).
We should restructure our controller to be a state machine, so that the unpacking operation can be transitioned through the different states, instead of the entire operation being an all-in blocking operation.
The different states for the transition machine could be something like:
Initiated->Pulling->Unpacking->Unpacked
I believe this is a result of synchronous image pulling. The fix for this should be to design and implement async image pulling. At that point, the reconciler could kick off a background goroutine to pull an image, and then immediately update status to say "Pulling" somehow.
Then later when the image pull is complete (or fails), we could signal the reconciler to tell it to reconcile again at which point it could find the result and continue unpacking/processing it.
@joelanford you'd mentioned this call but I don't think I picked up your points that well in the call:
What was the reason you mentioned again for not doing a state machine again? What you're saying here
At that point, the reconciler could kick off a background goroutine to pull an image, and then immediately update status to say "Pulling" somehow. Then later when the image pull is complete (or fails), we could signal the reconciler to tell it to reconcile again at which point it could find the result and continue unpacking/processing it.
seems much easier to do if we just requeueAfter in each state, until we reach a "final" state (of success or error), with exponential backoff in case of errors.
This is a level-based design too.
we could signal the reconciler to tell it to reconcile again
sounds like you're suggesting an edge-based design.
Maybe I'm misinterpreting what you mean by a state machine design, but I've seen cases where controllers are implemented as state machines where they make assumptions that the status of an object is correct and then execute a state's logic based on that assumption. But if the status is incorrect, then the logic that is executed is also incorrect and/or incomplete, and that might mean we never successfully run other parts of the state machine that we think will always run.
The correct design of a controller is to read the actual and current state of the world and then do as much as possible to drive toward that desired state.
we could signal the reconciler to tell it to reconcile again
sounds like you're suggesting an edge-based design.
Not quite. Think of the async image puller a lot like the apiserver and our client that fetches from it, and the watches we register with it.
- We ask it: "do you have the data we expect?" If so, great we use it and continue on with that data.
- If it doesn't have the data we expect, we ask it to fetch that data in the background (kind of like how we used to create a unpack pod), and then update our status to say "don't have it yet, but it's being fetched"
- Just like when the apiserver sends an event via an informer about a change in an object in etcd, we could have something like an informer against the async image puller, and we'd register that as a
source.Channel(or similar) on our controller such that we would just get a generic event for "this catalog needs to be reconciled"
So then anytime we reconcile, no matter why we're reconciling, we're looking at current state (what is in our APIserver cache, what is in our image cache) and then taking action based on what we see there.
I think I see what you mean, let me try to find some time to prototype this. I'll update here once the prototype is ready.
Issues go stale after 90 days of inactivity. If there is no further activity, the issue will be closed in another 30 days.
This issue has been closed due to inactivity.