Should avoid status check in operator sync loop
Is your feature request related to a problem? Please describe.
Currently, there are some "status check" logic in the sync loop. Such as :
-
ensureNodePlugins: Polling to check Nvidia Plugin status for 60 secs. -
CheckNodeClaimStatus: Polling to check nodeClaim status for 240 secs -
CheckResourceStatus: Polling to check deployment/statefulset/job status until timeout. - Found another long-running operation: Currently the job is deleted in Foreground. We actually should delete it in background and in the following reconciliations check whether the old one still exists if not create a new one.
The status check will block the reconciliation, which violates the best practices for Kubernetes operators. If all the 5 concurrent reconcilers are all blocked, the operator will be unuseable. The right pattern is returning immediately if the status does not get into goal state and check again on the next reconciliation loop.
FYI Best practices for building Kubernetes Operators
Describe the solution you'd like The right pattern is returning immediately if the status does not get into goal state and check again on the next reconciliation loop.
Describe alternatives you've considered
Additional context
@cvvz I agree with your issue. It's a good idea to optimize reconcile block one by one, and don't modify kaito a lot in one pull request.
Yes, all the block parts should be refactored one by one.
Found another long-running operation: Currently the job is deleted in Foreground. We actually should delete it in background and in the following reconciliations check whether the old one still exists if not create a new one.
Yes. It is not optimal for sure. We just need to be careful here. In every reconcile, we compute whether we need to add/remove nodes. If the nodeclaims are in pending state, the calculation needs to be adjusted and make the decision "stable". We don't want to accidentally add/remove nodes which maybe ok from reconciliation perspective, but bad for UX.
This long-wait antipattern simplifies the target node computation.
For ensureplugin and check resource state, we can break the polling in sync loop.
Some comments from Deepwiki
Benefits of Using Sync Loops There are several benefits to using sync loops in the workspace controller:
Resource Efficiency: By waiting for resources to be ready before proceeding, the controller avoids creating redundant resources. For example, waiting for pending NodeClaims prevents the creation of unnecessary nodes.
Dependency Management: Some resources depend on others being ready first. For example, the inference workload can only be deployed after the GPU nodes are ready and have the necessary plugins installed.
Error Handling: Sync loops allow the controller to handle transient errors by retrying operations until they succeed or a timeout is reached.
Status Reporting: During the sync loop, the controller can update the status of the Workspace CR to reflect the current state of the reconciliation process: workspace_controller.go:337-342
Graceful Timeout Handling: If resources don't become ready within a reasonable time, the controller can fail gracefully with a meaningful error message: workspace_controller.go:504-506
Alternative Approaches Without sync loops, the controller would have to rely on one of these alternatives:
Kubernetes Watch Mechanisms: The controller could set up watches on resources and react to events. However, this would make the code more complex and potentially less reliable.
Requeue and Retry: The controller could simply requeue the reconciliation request and try again later. This is less efficient as it would involve more reconciliation cycles.
Asynchronous Processing: The controller could create resources asynchronously and not wait for them to be ready. This would make error handling and dependency management more difficult.
Imagine if we created 100 workspaces at the same time. All of them need to CreateNodeClaim and WaitForPendingNodeClaims, since the concurrency of reconciler is 5, if 5 workspaces are blocking in WaitForPendingNodeClaims, that means the operator can't handle the other 95 workspaces.
But assume if we change WaitForPendingNodeClaims to a new function named CheckForPendingNodeClaims, which just check the NodeClaims status and if not in desired state, just return nil immediately(will not trigger re-enqueue), and the next time NodeClaims event will still trigger reconciliation and we will CheckForPendingNodeClaims again...... until it is in desired state. In this way, almost all the 100 workspaces can be handled at the same time.