katib icon indicating copy to clipboard operation
katib copied to clipboard

Add entry in api docs for class method KatibClient.tune()

Open david-thrower opened this issue 8 months ago • 0 comments

Issue described

The documentation found in [1] depicts the code block below, which mentions the class method KatibClient.tune(), which the page for KatibClient [2] does not list. The source code does include this method [3].

# Create an objective function.
def objective(parameters):
    # Import required packages.
    import time
    time.sleep(5)
    # Calculate objective function.
    result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
    # Katib parses metrics in this format: <metric-name>=<metric-value>.
    print(f"result={result}")

import kubeflow.katib as katib

# Create hyperparameter search space.
parameters = {
    "a": katib.search.int(min=10, max=20),
    "b": katib.search.double(min=0.1, max=0.2)
}

#  Create Katib Experiment with 12 Trials and 2 CPUs per Trial.
katib_client = katib.KatibClient(namespace="kubeflow")

name = "tune-experiment"
## This class method is not listed in the docs master/sdk/python/v1beta1/docs/KatibClient.md
katib_client.tune(
    name=name,
    objective=objective,
    parameters=parameters,
    objective_metric_name="result",
    max_trial_count=12,
    resources_per_trial={"cpu": "2"},
)

#  Wait until Katib Experiment is complete
katib_client.wait_for_experiment_condition(name=name)

#  Get the best hyperparameters.
print(katib_client.get_optimal_hyperparameters(name))

Suggested revision:

Add under https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/docs/KatibClient.md#tune, add something to the basic effect of (mostly Gemini - generated):


# tune

Parameters

    name (str):
        Description: The desired name for the Katib Experiment resource to be created in Kubernetes. This name must be unique within the specified namespace.
    namespace (str, optional):
        Description: The Kubernetes namespace where the Katib Experiment and its associated Trial resources will be created.
        Required: No.
    objective Optional[Callable]:
        Description: Callable function to be optimized. This should be the training task, parameterized with the hyperparameters as the arguments and should print the string: f"{NAME_OF_METRIC}={result}". This object specifies the metric(s) to optimize (e.g., validation_loss, accuracy), the optimization goal (e.g., minimize, maximize), and potentially additional settings like metric strategies. You typically create this object using V1beta1ObjectiveSpec(...).
        Required: Yes.
    parameters (List[V1beta1ParameterSpec]):
        Description: A list defining the hyperparameters to be tuned. Each element in the list is a V1beta1ParameterSpec object describing a single hyperparameter, including its name (e.g., learning_rate, num_layers), type (int, double, categorical, discrete), and search space (e.g., min/max values, list of allowed values).
        Required: Yes.
    algorithm (V1beta1AlgorithmSpec):
        Description: Specifies the search algorithm Katib should use to explore the hyperparameter space (e.g., random, bayesianoptimization, hyperband, tpe). This object includes the algorithm name and optionally algorithm-specific settings. You create this using V1beta1AlgorithmSpec(...).
        Required: Yes.
    trial_template (V1beta1TrialTemplate):
        Description: Defines how to run a single trial (one evaluation of a specific set of hyperparameters). This is crucial and typically contains:
            The configuration for the primary container running the training code.
            Placeholders for Katib to inject the hyperparameter values for the current trial (e.g., ${trialParameters.learningRate}).
            Information on how the trial reports metrics back to Katib (e.g., using stdout metric collector, file metric collector, or custom sidecars).
            The specification of the Kubernetes resource (e.g., Job, TFJob, PyTorchJob) to be created for each trial.
        Required: Yes.
    max_trial_count (Optional[int], optional):
        Description: The maximum total number of Trials (hyperparameter evaluations) that this Experiment is allowed to run. Acts as a budget limit. If None, the experiment might run indefinitely or rely on other completion criteria.
        Required: No.
        Default: None
    parallel_trial_count (Optional[int], optional):
        Description: The maximum number of Trials that can run concurrently at any given time.
        Required: No.
        Default: None (Katib usually defaults this to 3 if not set).
    max_failed_trial_count (Optional[int], optional):
        Description: The maximum number of Trials that are allowed to fail before the entire Experiment is marked as Failed.
        Required: No.
        Default: None (Katib usually defaults this based on cluster config).
    resume_policy (Optional[str], optional):
        Description: Defines the behavior if the Katib controller restarts or the Experiment is interrupted. Common values are Never (default), FromVolume, LongRunning. Never means the experiment starts fresh if interrupted.
        Required: No.
        Default: V1beta1Experiment.DEFAULT_RESUME_POLICY (which is typically Never).

Returns

    V1beta1Experiment:
        Description: Returns a V1beta1Experiment object. This object is the Python representation of the Experiment custom resource that was created (or attempted to be created) in Kubernetes. It contains the full specification (spec) provided via the arguments, and potentially initial status information (status). You can use this object later with other KatibClient methods (like get_experiment) or standard Kubernetes client libraries to monitor the experiment's progress.

[1] https://www.kubeflow.org/docs/components/katib/getting-started/#example-using-random-search-algorithm

[2] https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/docs/KatibClient.md

[3] https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py

david-thrower avatar Apr 04 '25 22:04 david-thrower