kaito icon indicating copy to clipboard operation
kaito copied to clipboard

Should avoid status check in operator sync loop

Open cvvz opened this issue 8 months ago • 7 comments

Is your feature request related to a problem? Please describe.

Currently, there are some "status check" logic in the sync loop. Such as :

  1. ensureNodePlugins : Polling to check Nvidia Plugin status for 60 secs.
  2. CheckNodeClaimStatus: Polling to check nodeClaim status for 240 secs
  3. CheckResourceStatus: Polling to check deployment/statefulset/job status until timeout.
  4. Found another long-running operation: Currently the job is deleted in Foreground. We actually should delete it in background and in the following reconciliations check whether the old one still exists if not create a new one.

The status check will block the reconciliation, which violates the best practices for Kubernetes operators. If all the 5 concurrent reconcilers are all blocked, the operator will be unuseable. The right pattern is returning immediately if the status does not get into goal state and check again on the next reconciliation loop.

FYI Best practices for building Kubernetes Operators

Describe the solution you'd like The right pattern is returning immediately if the status does not get into goal state and check again on the next reconciliation loop.

Describe alternatives you've considered

Additional context

cvvz avatar Apr 26 '25 03:04 cvvz

@cvvz I agree with your issue. It's a good idea to optimize reconcile block one by one, and don't modify kaito a lot in one pull request.

rambohe-ch avatar Apr 26 '25 06:04 rambohe-ch

Yes, all the block parts should be refactored one by one.

cvvz avatar Apr 26 '25 08:04 cvvz

Found another long-running operation: Currently the job is deleted in Foreground. We actually should delete it in background and in the following reconciliations check whether the old one still exists if not create a new one.

cvvz avatar Apr 26 '25 08:04 cvvz

Yes. It is not optimal for sure. We just need to be careful here. In every reconcile, we compute whether we need to add/remove nodes. If the nodeclaims are in pending state, the calculation needs to be adjusted and make the decision "stable". We don't want to accidentally add/remove nodes which maybe ok from reconciliation perspective, but bad for UX.

This long-wait antipattern simplifies the target node computation.

For ensureplugin and check resource state, we can break the polling in sync loop.

Fei-Guo avatar Apr 27 '25 18:04 Fei-Guo

Some comments from Deepwiki

Benefits of Using Sync Loops There are several benefits to using sync loops in the workspace controller:

Resource Efficiency: By waiting for resources to be ready before proceeding, the controller avoids creating redundant resources. For example, waiting for pending NodeClaims prevents the creation of unnecessary nodes.

Dependency Management: Some resources depend on others being ready first. For example, the inference workload can only be deployed after the GPU nodes are ready and have the necessary plugins installed.

Error Handling: Sync loops allow the controller to handle transient errors by retrying operations until they succeed or a timeout is reached.

Status Reporting: During the sync loop, the controller can update the status of the Workspace CR to reflect the current state of the reconciliation process: workspace_controller.go:337-342

Graceful Timeout Handling: If resources don't become ready within a reasonable time, the controller can fail gracefully with a meaningful error message: workspace_controller.go:504-506

Alternative Approaches Without sync loops, the controller would have to rely on one of these alternatives:

Kubernetes Watch Mechanisms: The controller could set up watches on resources and react to events. However, this would make the code more complex and potentially less reliable.

Requeue and Retry: The controller could simply requeue the reconciliation request and try again later. This is less efficient as it would involve more reconciliation cycles.

Asynchronous Processing: The controller could create resources asynchronously and not wait for them to be ready. This would make error handling and dependency management more difficult.

Fei-Guo avatar Apr 27 '25 18:04 Fei-Guo

Imagine if we created 100 workspaces at the same time. All of them need to CreateNodeClaim and WaitForPendingNodeClaims, since the concurrency of reconciler is 5, if 5 workspaces are blocking in WaitForPendingNodeClaims, that means the operator can't handle the other 95 workspaces. But assume if we change WaitForPendingNodeClaims to a new function named CheckForPendingNodeClaims, which just check the NodeClaims status and if not in desired state, just return nil immediately(will not trigger re-enqueue), and the next time NodeClaims event will still trigger reconciliation and we will CheckForPendingNodeClaims again...... until it is in desired state. In this way, almost all the 100 workspaces can be handled at the same time.

cvvz avatar Apr 28 '25 10:04 cvvz