kaito icon indicating copy to clipboard operation
kaito copied to clipboard

fix: ensureNodePlugins does not work

Open cvvz opened this issue 8 months ago • 4 comments

Reason for Change:

resources.CheckNvidiaPlugin(ctx, nodeObj) checked nodeObj, which is a cached value. Retry is no use.

Another problem is that return nil in default condition will always be executed no matter found or not found. I think this is not what we want. This fix assume that we only return nil when found.

Requirements

  • [ ] added unit tests and e2e tests (if applicable).

Issue Fixed:

Notes for Reviewers:

cvvz avatar Apr 28 '25 09:04 cvvz

Title

(Describe updated until commit https://github.com/kaito-project/kaito/commit/219228de68d29f84d61007b7eeb10a5c5ec9c524)

Fix node plugin installation and add tests


Description

  • Ensure fresh node object retrieval in ensureNodePlugins

  • Correct logic for Nvidia plugin check and label update

  • Add unit tests for UpdateNodeWithLabel and ensureNodePlugins


Changes walkthrough 📝

Relevant files
Bug fix
ragengine_controller.go
Fix Nvidia plugin installation logic                                         

pkg/ragengine/controllers/ragengine_controller.go

  • Retrieve fresh node object before checking Nvidia plugin
  • Correct logic to return nil only when Nvidia plugin is found
  • +21/-13 
    workspace_controller.go
    Fix Nvidia plugin installation logic                                         

    pkg/workspace/controllers/workspace_controller.go

  • Retrieve fresh node object before checking Nvidia plugin
  • Correct logic to return nil only when Nvidia plugin is found
  • +21/-13 
    Tests
    ragengine_controller_test.go
    Add tests for node retrieval                                                         

    pkg/ragengine/controllers/ragengine_controller_test.go

  • Add test case for NotFound error during node retrieval
  • Insert node object into the map for testing
  • +29/-0   
    nodes_test.go
    Update tests for `UpdateNodeWithLabel`                                     

    pkg/utils/resources/nodes_test.go

  • Add test cases for UpdateNodeWithLabel including label presence check
  • Remove redundant test case
  • +17/-9   
    workspace_controller_test.go
    Add tests for node retrieval                                                         

    pkg/workspace/controllers/workspace_controller_test.go

  • Add test case for NotFound error during node retrieval
  • Insert node object into the map for testing
  • +31/-0   
    Enhancement
    nodes.go
    Enhance `UpdateNodeWithLabel` function                                     

    pkg/utils/resources/nodes.go

  • Update UpdateNodeWithLabel to accept *corev1.Node directly
  • Skip update if node already has Nvidia label
  • +8/-9     

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • kaito-pr-agent[bot] avatar Apr 28 '25 09:04 kaito-pr-agent[bot]

    PR Reviewer Guide 🔍

    (Review updated until commit https://github.com/kaito-project/kaito/commit/219228de68d29f84d61007b7eeb10a5c5ec9c524)

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Possible Issue

    The return nil statement inside the default case will now always return nil if the Nvidia plugin is found, which might not be the intended behavior. The original code returned nil only if the Nvidia plugin was not found.

    	return reconcile.Result{}, err
    }
    
    Possible Issue

    Similar to the issue in ragengine_controller.go, the return nil statement inside the default case will now always return nil if the Nvidia plugin is found, which might not be the intended behavior. The original code returned nil only if the Nvidia plugin was not found.

    	return nil
    }
    

    kaito-pr-agent[bot] avatar Apr 28 '25 09:04 kaito-pr-agent[bot]

    PR Code Suggestions ✨

    Latest suggestions up to 219228d

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    General
    Simplify label check

    Remove unnecessary nested if statement for checking Nvidia label value.

    pkg/utils/resources/nodes.go [53-55]

    -if nvidiaLabelVal, found := freshNode.Labels[LabelKeyNvidia]; found {
    -    if nvidiaLabelVal == LabelValueNvidia {
    -        return nil
    -    }
    +if freshNode.Labels[LabelKeyNvidia] == LabelValueNvidia {
    +    return nil
     }
    
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion removes unnecessary nesting, making the code cleaner and easier to read without changing the functionality.

    Medium
    Remove obsolete test

    Remove redundant test case as the function no longer retrieves the node within
    UpdateNodeWithLabel.

    pkg/utils/resources/nodes_test.go [25-30]

    -"Fail to update node because it cannot be retrieved": {
    -    callMocks: func(c *test.MockClient) {
    -        c.On("Get", mock.IsType(context.Background()), client.ObjectKey{Name: "mockNode"}, mock.IsType(&corev1.Node{}), mock.Anything).Return(errors.New("Cannot retrieve node"))
    -    },
    -    expectedError: errors.New("Cannot retrieve node"),
    -},
    +# Test case removed
    
    Suggestion importance[1-10]: 8

    __

    Why: The test case is indeed obsolete since the function no longer retrieves the node within UpdateNodeWithLabel, making this test irrelevant.

    Medium

    Previous suggestions

    Suggestions up to commit 6b4fe1f
    CategorySuggestion                                                                                                                                    Impact
    General
    Simplify label check

    Remove unnecessary label value check.

    pkg/utils/resources/nodes.go [53-55]

    -if nvidiaLabelVal, found := freshNode.Labels[LabelKeyNvidia]; found {
    -    if nvidiaLabelVal == LabelValueNvidia {
    -        return nil
    -    }
    +if _, found := freshNode.Labels[LabelKeyNvidia]; found {
    +    return nil
     }
    
    Suggestion importance[1-10]: 6

    __

    Why: The suggestion simplifies the label check by removing the unnecessary value comparison, which improves code readability and maintainability without changing the functionality.

    Low
    Suggestions up to commit 537b325
    CategorySuggestion                                                                                                                                    Impact
    General
    Simplify label check

    Remove unnecessary label value check.

    pkg/utils/resources/nodes.go [53-55]

    -if nvidiaLabelVal, found := freshNode.Labels[LabelKeyNvidia]; found {
    -    if nvidiaLabelVal == LabelValueNvidia {
    -        return nil
    -    }
    +if _, found := freshNode.Labels[LabelKeyNvidia]; found {
    +    return nil
     }
    
    Suggestion importance[1-10]: 6

    __

    Why: The suggestion simplifies the label check by removing the unnecessary value comparison. This improves the code's readability and maintainability without affecting its functionality.

    Low
    Suggestions up to commit 5f9a5d0
    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix Nvidia plugin installation logic

    Correct logic to install Nvidia plugin if not found.

    pkg/workspace/controllers/workspace_controller.go [511-513]

    -if found := resources.CheckNvidiaPlugin(ctx, freshNode); found {
    -    return nil
    +if !resources.CheckNvidiaPlugin(ctx, freshNode) {
    +    err = resources.UpdateNodeWithLabel(ctx, freshNode, resources.LabelKeyNvidia, resources.LabelValueNvidia, c.Client)
    +    if apierrors.IsNotFound(err) {
    +        klog.ErrorS(err, "nvidia plugin cannot be installed, node not found", "node", freshNode.Name)
    +        if updateErr := c.updateStatusConditionIfNotMatch(ctx, wObj, kaitov1beta1.ConditionTypeNodeClaimStatus, metav1.ConditionFalse,
    +            "checkNodeClaimStatusFailed", err.Error()); updateErr != nil {
    +            klog.ErrorS(updateErr, "failed to update workspace status", "workspace", klog.KObj(wObj))
    +            return updateErr
    +        }
    +        return err
    +    }
     }
    
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion corrects the logic to ensure the Nvidia plugin is installed only if not found, addressing a potential bug in the original code.

    Medium
    General
    Simplify Nvidia label check

    Remove unnecessary check for Nvidia label value.

    pkg/utils/resources/nodes.go [56-59]

    -if nvidiaLabelVal, found := freshNode.Labels[LabelKeyNvidia]; found {
    -    if nvidiaLabelVal == LabelValueNvidia {
    -        return nil
    -    }
    +if _, found := freshNode.Labels[LabelKeyNvidia]; found {
    +    return nil
     }
    
    Suggestion importance[1-10]: 5

    __

    Why: The suggestion simplifies the check by removing the unnecessary comparison, which improves readability but has minimal functional impact.

    Low
    Suggestions up to commit 703d9bd
    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix Nvidia plugin check logic

    Correct logic to update label if Nvidia plugin is not found.

    pkg/workspace/controllers/workspace_controller.go [511-513]

    -if found := resources.CheckNvidiaPlugin(ctx, freshNode); found {
    -    return nil
    +if !resources.CheckNvidiaPlugin(ctx, freshNode) {
    +    err = resources.UpdateNodeWithLabel(ctx, freshNode, resources.LabelKeyNvidia, resources.LabelValueNvidia, c.Client)
    +    if apierrors.IsNotFound(err) {
    +        klog.ErrorS(err, "nvidia plugin cannot be installed, node not found", "node", freshNode.Name)
    +        if updateErr := c.updateStatusConditionIfNotMatch(ctx, wObj, kaitov1beta1.ConditionTypeNodeClaimStatus, metav1.ConditionFalse,
    +            "checkNodeClaimStatusFailed", err.Error()); updateErr != nil {
    +            klog.ErrorS(updateErr, "failed to update workspace status", "workspace", klog.KObj(wObj))
    +            return updateErr
    +        }
    +        return err
    +    } else if err != nil {
    +        klog.Warning("UpdateNodeWithLabel failed", "node", freshNode.Name, "error", err)
    +    }
     }
    
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion corrects the logic to update the label if the Nvidia plugin is not found, addressing a potential bug in the original code.

    Medium
    General
    Simplify Nvidia label check

    Remove unnecessary check for Nvidia label value.

    pkg/utils/resources/nodes.go [56-59]

    -if nvidiaLabelVal, found := freshNode.Labels[LabelKeyNvidia]; found {
    -    if nvidiaLabelVal == LabelValueNvidia {
    -        return nil
    -    }
    +if _, found := freshNode.Labels[LabelKeyNvidia]; found {
    +    return nil
     }
    
    Suggestion importance[1-10]: 5

    __

    Why: The suggestion removes an unnecessary check for the Nvidia label value, which simplifies the code but does not address any critical issue.

    Low
    Improve error logging

    Log error details instead of generic message.

    pkg/workspace/controllers/workspace_controller.go [525]

    -klog.Warning("UpdateNodeWithLabel failed", "node", freshNode.Name, "error", err)
    +klog.WarningS("UpdateNodeWithLabel failed", "node", freshNode.Name, "error", err)
    
    Suggestion importance[1-10]: 3

    __

    Why: The suggestion improves error logging by using klog.WarningS instead of klog.Warning, which is a minor improvement in logging detail.

    Low

    kaito-pr-agent[bot] avatar Apr 28 '25 09:04 kaito-pr-agent[bot]

    If we add node informer(https://github.com/kaito-project/kaito/pull/1052), resources.GetNode will get node from cache rather than apiserver. But this fix should still work, because node informer will make sure the cache always sync with apiserver.

    cvvz avatar Apr 28 '25 10:04 cvvz

    @cvvz I agree with you, we need to get fresh node before checking nvidia plugin. btw: please fix go lint errors and unit tests errors.

    rambohe-ch avatar Apr 29 '25 12:04 rambohe-ch