katib icon indicating copy to clipboard operation
katib copied to clipboard

Duplicate hyperparameters waste compute and time

Open Antsypc opened this issue 5 months ago • 6 comments

What happened?

I am experiencing a recurring issue where the hyperparameter tuning process generates duplicate sets of parameters, leading to inefficient use of GPU resources.

For instance, with the experimental setup below:

spec:
  algorithm:
    algorithmName: bayesianoptimization
  maxTrialCount: 10
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    goal: 1
    metricStrategies:
      - name: Accuracy
        value: max
    objectiveMetricName: Accuracy
    type: maximize
  parallelTrialCount: 1
  parameters:
    - feasibleSpace:
        list:
          - "0.01"
          - "1"
          - "5"
          - "10"
          - "0.1"
      name: C
      parameterType: categorical
    - feasibleSpace:
        list:
          - linear
          - rbf
          - poly
          - sigmoid
      name: kernal
      parameterType: categorical
    - feasibleSpace:
        list:
          - "0.1"
          - "0.001"
          - "0.01"
      name: gamma
      parameterType: categorical

The suggestions provided by the algorithm are often redundant. The following output illustrates this problem, showing only the duplicated suggestions for clarity:

spec:
  algorithm:
    algorithmName: bayesianoptimization
  requests: 10
  resumePolicy: Never
status:
  suggestionCount: 10
  suggestions:
    - name: mnist-pytorch-rep-bo-1-v1-g2qml7vd
      parameterAssignments:
        - name: C
          value: "10"
        - name: kernal
          value: poly
        - name: gamma
          value: "0.1"
    - name: mnist-pytorch-rep-bo-1-v1-hjzn82gn
      parameterAssignments:
        - name: C
          value: "10"
        - name: kernal
          value: poly
        - name: gamma
          value: "0.1"

I have run this experiment repeatedly, and in extreme cases, all 10 suggestions are identical.

What did you expect to happen?

  • Filter out duplicated hyperparameters.
  • Early stop once the hyperparameter search space is exhausted.

Environment

  • Kubernetes version: v1.25.14
  • Katib controller version: kubeflow/katib-controller:v0.16.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Antsypc avatar Sep 12 '25 03:09 Antsypc

Proposal: Prevent Duplicate Hyperparameter Suggestions

Issue: #2571
Summary: Add duplicate detection in suggestion controller to prevent wasted compute on identical hyperparameter trials.

Problem

Katib allows duplicate suggestions to create trials. Users report "in extreme cases, all 10 suggestions are identical" causing:

  • Wasted GPU/CPU on redundant experiments
  • Inefficient use of trial budgets
  • Poor user experience with small categorical search spaces

Example (60 categorical combinations, Bayesian Optimization):

# Both suggestions identical - waste of resources
- trial-1: {C: "10", kernel: poly, gamma: "0.1"}
- trial-2: {C: "10", kernel: poly, gamma: "0.1"}

Root Cause:

  • scikit-optimize uses random sampling for first n_initial_points=10 trials
  • Small categorical spaces → high duplicate probability
  • No validation layer prevents duplicate trials

Solution

Add duplicate filtering in SyncAssignments() before creating trials:

// pkg/controller.v1beta1/suggestion/suggestionclient/suggestionclient.go

func parametersEqual(a, b []commonapiv1beta1.ParameterAssignment) bool {
    if len(a) != len(b) { return false }
    aMap := make(map[string]string)
    for _, p := range a { aMap[p.Name] = p.Value }
    for _, p := range b {
        if aMap[p.Name] != p.Value { return false }
    }
    return true
}

func isDuplicate(
    assignment []commonapiv1beta1.ParameterAssignment,
    suggestions []suggestionsv1beta1.TrialAssignment,
    trials []trialsv1beta1.Trial) bool {
    
    for _, s := range suggestions {
        if parametersEqual(assignment, s.ParameterAssignments) { return true }
    }
    for _, t := range trials {
        if parametersEqual(assignment, t.Spec.ParameterAssignments) { return true }
    }
    return false
}

func (g *General) SyncAssignments(...) error {
    // ... get suggestions from algorithm ...
    
    uniqueAssignments := []suggestionsv1beta1.TrialAssignment{}
    duplicateCount := 0
    
    for _, suggestion := range responseSuggestion.ParameterAssignments {
        if isDuplicate(suggestion.Assignments, instance.Status.Suggestions, ts) {
            duplicateCount++
            continue
        }
        uniqueAssignments = append(uniqueAssignments, createTrialAssignment(suggestion))
    }
    
    if duplicateCount > 0 {
        g.recorder.Eventf(instance, corev1.EventTypeWarning, 
            "DuplicateSuggestionsFiltered",
            "Filtered %d duplicate(s), created %d unique suggestion(s)",
            duplicateCount, len(uniqueAssignments))
    }
    
    instance.Status.Suggestions = append(instance.Status.Suggestions, uniqueAssignments...)
    return nil
}

Why controller-side:

  • ✅ Works with all algorithms (no library changes)
  • ✅ Single validation point
  • ✅ Backward compatible
  • ✅ O(n) performance with hash maps

Testing

Unit Tests:

func TestParametersEqual(t *testing.T) {
    // Test identical parameters → true
    // Test different order, same values → true  
    // Test different values → false
}

func TestIsDuplicate(t *testing.T) {
    // Test against existing suggestions
    // Test against existing trials
    // Test with no duplicates
}

Integration Test - Small categorical space (2×2=4 combinations):

parameters:
  - name: optimizer
    parameterType: categorical
    feasibleSpace: {list: ["adam", "sgd"]}
  - name: activation
    parameterType: categorical
    feasibleSpace: {list: ["relu", "tanh"]}
maxTrialCount: 10

Expected: 4 unique trials, duplicates filtered, warning event emitted.

Alternatives Considered

Alternative Reason Rejected
Modify algorithm services (skopt, optuna) High maintenance, must fork libraries
Database deduplication Performance overhead, complex schema changes
User-configurable option Doesn't solve problem by default

Files Modified:

  • pkg/controller.v1beta1/suggestion/suggestionclient/suggestionclient.go
  • pkg/controller.v1beta1/suggestion/suggestionclient/suggestionclient_test.go

NarayanaSabari avatar Nov 05 '25 15:11 NarayanaSabari

@andreyvelich Could you please review this proposal and let me know if you have any feedback? Once I get your input, I’ll start working on the issue.

NarayanaSabari avatar Nov 05 '25 15:11 NarayanaSabari

Thanks for creating this @Antsypc! @NarayanaSabari I would suggest that we first of all migrate to out of deprecated skopt algorithm service: https://github.com/kubeflow/katib/issues/2280

As we discussed, we would like to use Optuna's GPSampler: https://github.com/kubeflow/katib/issues/2280#issuecomment-1993658378

Then, we need to see whether this sample produce any duplicated hyperparameters.

Do you want to take over this task ?

cc @contramundum53

andreyvelich avatar Nov 05 '25 16:11 andreyvelich

Thanks @andreyvelich for the feedback!

You're absolutely right - addressing the root cause makes more sense. I'll take over the migration from deprecated scikit-optimize to Optuna's GPSampler (#2280) first.

Plan:

  1. Complete the skopt → Optuna GPSampler migration
  2. Test if duplicate hyperparameters still occur with the new algorithm
  3. If duplicates persist, implement the controller-level filtering as an algorithm-agnostic safety layer

This approach will modernize Katib's Bayesian Optimization to use a maintained library while naturally testing whether duplicates are algorithm-specific or a broader issue.

I'll start working on #2280 and report back with findings. Thanks for the guidance!

NarayanaSabari avatar Nov 06 '25 03:11 NarayanaSabari

/assign

NarayanaSabari avatar Nov 06 '25 03:11 NarayanaSabari

If you decide to use GPSampler in Optuna and you don't want duplicate suggestions, I would suggest using deterministic_objective=True. That option basically fixes the noise variance to a very low level, and would hopefully eliminate most of the duplicate suggestions.

If you have stricter requirements (if you want to never have duplicate suggestions, even stop automatically when the search space is exhausted), you would still need some mechanisms to prevent them.

One limitation is that deterministic_objective=True really assumes that the objective function is deterministic, and if it's not the optimization may become unstable (especially with continuous search space where you can have lots of points that are very close but not the same).

contramundum53 avatar Nov 12 '25 04:11 contramundum53