kaito icon indicating copy to clipboard operation
kaito copied to clipboard

fix: move nodeclaim check to right place

Open cvvz opened this issue 8 months ago • 6 comments

Reason for Change:

The PR mainly did two things:

  1. move nodeclaim check to right place. After creating nodeclaim, we should check the status in following reconcile, rather than polling to check the status.
  2. Fix a bug: This code return reconcile.Result{Requeue: false}, err actually cant stop reconciliation, because when error is not nil, the reconcile.Result{Requeue: false} will be ignored. We should use reconcile.TerminalError Requirements
  • [ ] added unit tests and e2e tests (if applicable).

Issue Fixed:

Notes for Reviewers:

cvvz avatar Apr 27 '25 10:04 cvvz

Title

(Describe updated until commit https://github.com/kaito-project/kaito/commit/0cb969fdf8e529bd326cf17c513473b4c7ee6ef6)

Improve NodeClaim Handling and Error Management


Description

  • Moved nodeclaim check to post-creation reconcile

  • Replaced incorrect error handling with reconcile.TerminalError

  • Simplified nodeclaim creation logic by removing unnecessary retries

  • Updated tests to reflect changes


Changes walkthrough 📝

Relevant files
Bug fix
ragengine_controller.go
Update error handling and remove redundant checks               

pkg/ragengine/controllers/ragengine_controller.go

  • Removed erroneous nodeclaim check after creation
  • Updated error handling to use reconcile.TerminalError
  • +0/-4     
    workspace_controller.go
    Remove redundant nodeclaim check                                                 

    pkg/workspace/controllers/workspace_controller.go

    • Removed erroneous nodeclaim check after creation
    +0/-4     
    Enhancement
    nodeclaim.go
    Simplify nodeclaim creation and handle SKU unavailability

    pkg/utils/nodeclaim/nodeclaim.go

  • Simplified CreateNodeClaim function by removing retry logic
  • Added reconcile.TerminalError for SKU unavailability
  • Moved node class check before nodeclaim creation
  • +16/-27 
    Tests
    nodeclaim_test.go
    Update tests for nodeclaim changes                                             

    pkg/utils/nodeclaim/nodeclaim_test.go

  • Removed test cases for removed retry logic
  • Updated test cases to reflect new error handling
  • +2/-28   
    workspace_controller_test.go
    Update tests for error handling                                                   

    pkg/workspace/controllers/workspace_controller_test.go

  • Updated test cases to reflect new error handling
  • Added test case for SKU unavailability
  • +26/-11 

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • kaito-pr-agent[bot] avatar Apr 27 '25 10:04 kaito-pr-agent[bot]

    PR Reviewer Guide 🔍

    (Review updated until commit https://github.com/kaito-project/kaito/commit/0cb969fdf8e529bd326cf17c513473b4c7ee6ef6)

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Possible Issue

    The comment about stopping reconciliation due to machine/nodeClaim instance types unavailability was removed, but the logic to handle this case was not fully replaced. Ensure that the new logic correctly handles this scenario.

    return reconcile.Result{}, err
    
    Possible Issue

    The retry mechanism was removed from CreateNodeClaim. Ensure that the new implementation correctly handles transient errors and retries as necessary.

    // CreateNodeClaim creates a nodeClaim object.
    func CreateNodeClaim(ctx context.Context, nodeClaimObj *karpenterv1.NodeClaim, kubeClient client.Client) error {
    	klog.InfoS("CreateNodeClaim", "nodeClaim", klog.KObj(nodeClaimObj))
    	if featuregates.FeatureGates[consts.FeatureFlagEnsureNodeClass] {
    		err := CheckNodeClass(ctx, kubeClient)
    		if err != nil {
    			return err
    		}
    	}
    
    	return kubeClient.Create(ctx, nodeClaimObj, &client.CreateOptions{})
    }
    
    Possible Issue

    Similar to the ragengine_controller.go, the comment about stopping reconciliation due to machine/nodeClaim instance types unavailability was removed without replacing the logic. Ensure that the new logic correctly handles this scenario.

    return reconcile.Result{}, err
    

    kaito-pr-agent[bot] avatar Apr 27 '25 10:04 kaito-pr-agent[bot]

    PR Code Suggestions ✨

    Latest suggestions up to 0cb969f

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Add error handling for unavailable instance types

    Handle the case where the node claim creation fails due to unavailable instance
    types.

    pkg/utils/nodeclaim/nodeclaim.go [246-255]

     if featuregates.FeatureGates[consts.FeatureFlagEnsureNodeClass] {
         err := CheckNodeClass(ctx, kubeClient)
         if err != nil {
             return err
         }
     }
    -return kubeClient.Create(ctx, nodeClaimObj, &client.CreateOptions{})
    +err := kubeClient.Create(ctx, nodeClaimObj, &client.CreateOptions{})
    +if err != nil {
    +    _, conditionFound := lo.Find(nodeClaimObj.GetConditions(), func(condition status.Condition) bool {
    +        return condition.Type == karpenterv1.ConditionTypeLaunched &&
    +            condition.Status == metav1.ConditionFalse && strings.Contains(condition.Message, consts.ErrorInstanceTypesUnavailable)
    +    })
    +    if conditionFound {
    +        klog.Error(consts.ErrorInstanceTypesUnavailable, "reconcile will not continue")
    +        return reconcile.TerminalError(fmt.Errorf(consts.ErrorInstanceTypesUnavailable))
    +    }
    +    return err
    +}
    
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion addresses a critical issue by adding error handling for unavailable instance types, which can prevent unnecessary retries and improve the robustness of the code.

    Medium

    Previous suggestions

    Suggestions up to commit d60224c
    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Add error handling after node creation

    Handle the case where the node class check fails and ensure proper error handling.

    pkg/utils/nodeclaim/nodeclaim.go [246-256]

     if featuregates.FeatureGates[consts.FeatureFlagEnsureNodeClass] {
         err := CheckNodeClass(ctx, kubeClient)
         if err != nil {
             return err
         }
     }
    -return kubeClient.Create(ctx, nodeClaimObj, &client.CreateOptions{})
    +err := kubeClient.Create(ctx, nodeClaimObj, &client.CreateOptions{})
    +if err != nil {
    +    return err
    +}
    
    Suggestion importance[1-10]: 7

    __

    Why: The suggestion adds error handling after attempting to create a node claim, which is a good practice to ensure that any errors during creation are properly managed and returned.

    Medium
    Suggestions up to commit 894f8ec
    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Add error handling after node creation

    Handle the case where the node class check fails and ensure proper error handling.

    pkg/utils/nodeclaim/nodeclaim.go [246-256]

     if featuregates.FeatureGates[consts.FeatureFlagEnsureNodeClass] {
         err := CheckNodeClass(ctx, kubeClient)
         if err != nil {
             return err
         }
     }
    -return kubeClient.Create(ctx, nodeClaimObj, &client.CreateOptions{})
    +err := kubeClient.Create(ctx, nodeClaimObj, &client.CreateOptions{})
    +if err != nil {
    +    return err
    +}
    
    Suggestion importance[1-10]: 7

    __

    Why: The suggestion adds error handling after node creation, which is a good practice to ensure that any errors during creation are properly handled and returned.

    Medium
    Verify terminal error handling

    Ensure the test case correctly simulates the scenario where the SKU is unavailable
    and verify the terminal error is returned.

    pkg/workspace/controllers/workspace_controller_test.go [325-338]

    +callMocks: func(c *test.MockClient) {
    +    c.On("Get", mock.IsType(context.Background()), mock.Anything, mock.IsType(&azurev1alpha2.AKSNodeClass{}), mock.Anything).Return(nil)
    +    c.On("Create", mock.IsType(context.Background()), mock.IsType(&azurev1alpha2.AKSNodeClass{}), mock.Anything).Return(nil)
    +    c.On("Create", mock.IsType(context.Background()), mock.IsType(&karpenterv1.NodeClaim{}), mock.Anything).Return(nil)
    +    c.On("Get", mock.IsType(context.Background()), mock.Anything, mock.IsType(&karpenterv1.NodeClaim{}), mock.Anything).Return(nil)
    +    c.On("Get", mock.IsType(context.Background()), mock.Anything, mock.IsType(&v1beta1.Workspace{}), mock.Anything).Return(nil)
    +    c.StatusMock.On("Update", mock.IsType(context.Background()), mock.IsType(&v1beta1.Workspace{}), mock.Anything).Return(nil)
    +},
    +cloudProvider: consts.AzureCloudName,
    +nodeClaimConditions: []status.Condition{
    +    {
    +        Type:    karpenterv1.ConditionTypeLaunched,
    +        Status:  v1.ConditionFalse,
    +        Message: consts.ErrorInstanceTypesUnavailable,
    +    },
    +},
    +workspace:     *test.MockWorkspaceWithPreset,
    +expectedError: reconcile.TerminalError(fmt.Errorf(consts.ErrorInstanceTypesUnavailable)),
     
    -
    
    Suggestion importance[1-10]: 7

    __

    Why: The suggestion ensures that the test case correctly simulates the scenario where the SKU is unavailable and verifies that the terminal error is returned, which is important for testing the new behavior.

    Medium
    General
    Update test case for new behavior

    Update the test case to reflect the new behavior and ensure it tests the correct
    error scenario.

    pkg/utils/nodeclaim/nodeclaim_test.go [24-34]

    +callMocks: func(c *test.MockClient) {
    +    c.On("Get", mock.IsType(context.Background()), mock.Anything, mock.IsType(&azurev1alpha2.AKSNodeClass{}), mock.Anything).Return(nil)
    +    c.On("Create", mock.IsType(context.Background()), mock.IsType(&azurev1alpha2.AKSNodeClass{}), mock.Anything).Return(nil)
    +    c.On("Create", mock.IsType(context.Background()), mock.IsType(&karpenterv1.NodeClaim{}), mock.Anything).Return(errors.New("test error"))
    +    c.On("Get", mock.IsType(context.Background()), mock.Anything, mock.IsType(&karpenterv1.NodeClaim{}), mock.Anything).Return(nil)
    +    c.On("Get", mock.IsType(context.Background()), mock.Anything, mock.IsType(&v1beta1.Workspace{}), mock.Anything).Return(nil)
    +    c.StatusMock.On("Update", mock.IsType(context.Background()), mock.IsType(&v1beta1.Workspace{}), mock.Anything).Return(nil)
    +},
    +cloudProvider:       consts.AzureCloudName,
    +nodeClaimConditions: []status.Condition{},
    +workspace:           *test.MockWorkspaceWithPreset,
    +expectedError:       errors.New("test error"),
     
    -
    
    Suggestion importance[1-10]: 2

    __

    Why: The suggestion does not actually change the test case; it merely repeats the existing code. Therefore, it does not provide any new value or improvement.

    Low
    Suggestions up to commit 0dd28d5
    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Check for unavailable instance types post creation

    Handle the case where the node claim creation fails due to unavailable instance
    types.

    pkg/utils/nodeclaim/nodeclaim.go [246-255]

     if featuregates.FeatureGates[consts.FeatureFlagEnsureNodeClass] {
         err := CheckNodeClass(ctx, kubeClient)
         if err != nil {
             return err
         }
     }
    -return kubeClient.Create(ctx, nodeClaimObj, &client.CreateOptions{})
    +err := kubeClient.Create(ctx, nodeClaimObj, &client.CreateOptions{})
    +if err != nil {
    +    _, conditionFound := lo.Find(nodeClaimObj.GetConditions(), func(condition status.Condition) bool {
    +        return condition.Type == karpenterv1.ConditionTypeLaunched &&
    +            condition.Status == metav1.ConditionFalse && strings.Contains(condition.Message, consts.ErrorInstanceTypesUnavailable)
    +    })
    +    if conditionFound {
    +        klog.Error(consts.ErrorInstanceTypesUnavailable, "reconcile will not continue")
    +        return reconcile.TerminalError(fmt.Errorf(consts.ErrorInstanceTypesUnavailable))
    +    }
    +    return err
    +}
    
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion addresses a critical issue by handling the case where node claim creation fails due to unavailable instance types, which was previously ignored in the PR. This ensures that the reconcile process stops when the specific error occurs, preventing unnecessary retries and improving the robustness of the code.

    Medium
    Suggestions up to commit 14a680e
    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Check SKU availability post-creation

    Handle the case where the SKU is unavailable after creating the NodeClaim.

    pkg/utils/nodeclaim/nodeclaim.go [247-253]

     err := CheckNodeClass(ctx, kubeClient)
     if err != nil {
         return err
     }
     
     err = kubeClient.Create(ctx, nodeClaimObj.DeepCopy(), &client.CreateOptions{})
     if err != nil {
         return err
     }
     
    +updatedObj := &karpenterv1.NodeClaim{}
    +err = kubeClient.Get(ctx, client.ObjectKey{Name: nodeClaimObj.Name, Namespace: nodeClaimObj.Namespace}, updatedObj, &client.GetOptions{})
    +if err != nil {
    +    return err
    +}
    +
    +_, conditionFound := lo.Find(updatedObj.GetConditions(), func(condition status.Condition) bool {
    +    return condition.Type == karpenterv1.ConditionTypeLaunched &&
    +        condition.Status == metav1.ConditionFalse && strings.Contains(condition.Message, consts.ErrorInstanceTypesUnavailable)
    +})
    +if conditionFound {
    +    klog.Error(consts.ErrorInstanceTypesUnavailable, "reconcile will not continue")
    +    return reconcile.TerminalError(fmt.Errorf(consts.ErrorInstanceTypesUnavailable))
    +}
    +
    
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion correctly addresses a critical issue by checking SKU availability after creating the NodeClaim, which was previously handled in a retry loop. This ensures that the reconciliation process stops if the SKU is unavailable, preventing unnecessary retries and improving the robustness of the code.

    Medium

    kaito-pr-agent[bot] avatar Apr 27 '25 10:04 kaito-pr-agent[bot]

    PR Code Suggestions ✨

    No code suggestions found for the PR.

    kaito-pr-agent[bot] avatar Apr 27 '25 10:04 kaito-pr-agent[bot]

    Nice clean up. Fix UT please.

    Fei-Guo avatar Apr 27 '25 17:04 Fei-Guo

    There is another finding. I just noticed that we used CheckNodeClaimStatus in both WaitForPendingNodeClaims and CreateNodeClaim. IMO since we have CheckNodeClaimStatus in WaitForPendingNodeClaims, there is no need to CheckNodeClaimStatus in CreateNodeClaim. Should we deleted the CheckNodeClaimStatus in CreateNodeClaim? cc @zhuangqh @Fei-Guo

    cvvz avatar Apr 28 '25 10:04 cvvz

    There is another finding. I just noticed that we used CheckNodeClaimStatus in both WaitForPendingNodeClaims and CreateNodeClaim. IMO since we have CheckNodeClaimStatus in WaitForPendingNodeClaims, there is no need to CheckNodeClaimStatus in CreateNodeClaim. Should we deleted the CheckNodeClaimStatus in CreateNodeClaim? cc @zhuangqh @Fei-Guo

    In CreateNodeClaim, it's CheckNodeClass not CheckNodeClaimStatus ?

    zhuangqh avatar Apr 29 '25 01:04 zhuangqh

    It's CheckNodeClaimStatus

    cvvz avatar Apr 29 '25 02:04 cvvz

    Oh I see. you mean WorkspaceReconciler.CreateNodeClaim not this CreateNodeClaim.

    If I understand correctly, CheckNodeClaimStatus in CreateNodeClaim is the first time you create nodeclaim and pending waiting for it. WaitForPendingNodeClaims is something when this obj requeue or controller restarts.

    If you remove that code, do you want to let it fail immediately at https://github.com/kaito-project/kaito/blob/main/pkg/workspace/controllers/workspace_controller.go#L487 and requeue for pending waiting?

    zhuangqh avatar Apr 29 '25 02:04 zhuangqh

    WaitForPendingNodeClaims is mainly used to protect a case where controller is restarted for various reasons, When the reconciliation loop restarts, we want it to wait for the nodeclaim created by previous reconciler loop to be ready before continuing.

    Fei-Guo avatar Apr 29 '25 02:04 Fei-Guo

    No need to fail. Just return nil after creating nodeclaim. Since we watched nodeclaim, if it status changed another reconciliation will be triggered.

    cvvz avatar Apr 29 '25 02:04 cvvz

    Actually, check status after creating can only be performed one time, because if no nodeclaim need to be created it won't be performed, but check status in WaitForPendingNodeClaims can always be performed.

    cvvz avatar Apr 29 '25 02:04 cvvz

    No need to fail. Just return nil after creating nodeclaim. Since we watched nodeclaim, if it status changed another reconciliation will be triggered.

    Yes. This way works. I agree with you.

    zhuangqh avatar Apr 29 '25 04:04 zhuangqh

    Let us keep all sync loop for now.

    Fei-Guo avatar Apr 29 '25 05:04 Fei-Guo

    Yes, I'm not planning to make the changes in this PR. Just raised for discussion. We need to first enhance the e2e testing capabilities before making major changes or refactoring.

    cvvz avatar Apr 29 '25 06:04 cvvz

    Yes, I'm not planning to make the changes in this PR. Just raised for discussion. We need to first enhance the e2e testing capabilities before making major changes or refactoring.

    @cvvz It's a good idea to improve E2E tests at first before improving sync loops.

    rambohe-ch avatar Apr 29 '25 11:04 rambohe-ch

    @Fei-Guo Can you take another look?

    zhuangqh avatar May 08 '25 12:05 zhuangqh

    @Fei-Guo @zhuangqh This pull request has removed a nodeclaim status check, and make code more readable. maybe we can include it in the 0.4.6 release.

    rambohe-ch avatar May 08 '25 23:05 rambohe-ch