katib icon indicating copy to clipboard operation
katib copied to clipboard

Failed parsing of V1beta1ParameterSpec in python 3.11.9 & kubeflow-katib: 0.18.0

Open david-thrower opened this issue 8 months ago • 11 comments

What happened?

Issue described:

Environment details:

  • Python 3.11.9 (Canonical managed Kubeflow Jupyter environment)
  • Kubeflow SDK versions
    • kubeflow: 0.0.1rc0
    • kubeflow-katib: 0.18.0
    • kubeflow-training: 1.9.0

TLDR:

  • I tried to include the param "activation": katib.search.categorical(["relu","gelu", "elu"]) in a parameters argument for KatibClient.tune().
  • this returned the error: HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validator.experiment.katib.kubeflow.org\" denied the request: feasibleSpace must be specified in spec.parameters[1]: {activation categorical { [] }}","code":400}

For reference, the relevant block of code:


def objective(parameters):
    # This is rather long, hence omitted.
    print f"val_binary_accuracy={result}"

# Import the SDK

import kubeflow.katib as katib

from kubeflow.katib import V1beta1FeasibleSpace
from kubeflow.katib import V1beta1ParameterSpec


# Workaround for failed parameter spec parsing ... katib.search.categorical(["relu","gelu", "elu"]),

foo = V1beta1ParameterSpec(
        parameter_type="categorical",
        feasible_space=V1beta1FeasibleSpace(distribution = ["relu","gelu", "elu"], list = ["relu","gelu", "elu"]),
    )
print(type(foo))
print(foo)

parameters = {
    "embedding_n": katib.search.int(min=10, max=18, step=1),

    # Workaround that worked
   "activation": foo,
    
    # Failing code that should have worked
    # "activation": katib.search.categorical(["relu","gelu", "elu"]),
    
    "predecessor_level_connection_affinity_factor_first": katib.search.double(min=0.1, max=50, step=0.1),
    "predecessor_level_connection_affinity_factor_main": katib.search.double(min=0.1, max=50, step=0.1),
    "max_consecutive_lateral_connections": katib.search.int(min=1, max=50, step=1),
    "p_lateral_connection":  katib.search.double(min=0.1, max=50, step=0.1),
    "num_lateral_connection_tries_per_unit": katib.search.int(min=1, max=50, step=1),
    "learning_rate": katib.search.double(min=10 ** -5, max=0.3, step=10 ** -5),
    "epochs": katib.search.int(min=1, max=25, step=1),
    "batch_size": katib.search.int(min=1, max=35, step=1),
    "dropout":  katib.search.double(min=0.05, max=0.95, step=0.05),
    "maximum_units_per_level": katib.search.int(min=5, max=10, step=1),
    "maximum_neurons_per_unit": katib.search.int(min=1, max=9, step=1),
    "temperature": katib.search.int(min=10**4, max=10**6, step=1000),
}

katib_client = katib.KatibClient(namespace=NAMESPACE)

algorithm_config = {
    "algorithm_name": "multivariate-tpe",
    "algorithm_settings": [
        {"name": "n_startup_trials", "value": "7"},
        {"name": "n_ei_candidates", "value": "24"},
        {"name": "random_state", "value": "42"}
    ]
}


katib_client.tune(
    name=JOB_NAME,
    objective=objective,
    parameters=parameters,
    objective_metric_name="val_binary_accuracy",
    algorithm_name = "multivariate-tpe",
    algorithm_settings= algorithm_config,
    max_trial_count=25,
    parallel_trial_count=2,
    max_failed_trial_count=15,
    resources_per_trial={"cpu": "7", "memory": "21Gi"}
)

I did a little troubleshooting:

  • I did some investigating by browsing the SDK code: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/search.py
  • I found that the function search.categorical() has a parameter named list.
  • I suspect that this could be the first potential problem, as this is a naming collision with the default Python keyword list in the global scope.
  • Digging deeper, however, I found that there is a problem in line 65 on https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/search.py. search.categorical() passes the argument for the parameter list to V1beta1FeasibleSpace without specifying which arg it pertains to.
  • When I manually reproduce what search.categorical() is doing internally as seen below and print this out, you see the issue becomes obvious ...
  • The property distribution is successfully set to the list which search.categorical() was configured with, but the property list is not being getting set, as you see below V1beta1ParameterSpec.feasible_space.list is set to None.

I do this:

foo = V1beta1ParameterSpec(
        parameter_type="categorical",
        feasible_space=V1beta1FeasibleSpace(["relu","gelu", "elu"]),
    )
print(type(foo))
print(foo)

This prints out:

<class 'kubeflow.katib.models.v1beta1_parameter_spec.V1beta1ParameterSpec'>
{'feasible_space': {'distribution': ['relu', 'gelu', 'elu'],
                    'list': None,  #   <------------------------------------------------------------------------------------<<<<   # I think this is the problem
                    'max': None,
                    'min': None,
                    'step': None},
 'name': None,
 'parameter_type': 'categorical'}

If I repeat this, but I explicitly reference the parameters*: distribution and list when instantiating the V1beta1FeasibleSpace which is used to instantiate V1beta1ParameterSpec, the properties ..distribution and ..list are both properly set:

my_list = ["relu","gelu", "elu"]
foo = V1beta1ParameterSpec(
        parameter_type="categorical",
        feasible_space=V1beta1FeasibleSpace(distribution = my_list, 
                                                                        list = my_list),
    )

print(type(foo))
print(foo)

Now this looks correct:

<class 'kubeflow.katib.models.v1beta1_parameter_spec.V1beta1ParameterSpec'>
{'feasible_space': {'distribution': ['relu', 'gelu', 'elu'],
                    'list': ['relu', 'gelu', 'elu'],
                    'max': None,
                    'min': None,
                    'step': None},
 'name': None,
 'parameter_type': 'categorical'}

When I replaced the call search.categorical(['relu', 'gelu', 'elu']]) with the correctly, manually created V1beta1ParameterSpec ('activations': foo), in the parameters and submit the job using CatibClient.tune(), the API accepted and created the job without the error.

Suggested revisions:

  1. In line 65 on https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/search.py: Reference the parameters distribution and list explicitly OR in V1beta1FeasibleSpace and V1beta1ParameterSpec, make sure that init is setting both self.distribution and self.list.
  2. Rename the parameter 'list', as this is a naming collision with a Python keyword referencing the function list(), which may be causing an empty list to be instantiated and assigned when V1beta1FeasibleSpace.list() sets self._list = list, which the empty list may be getting cast to a null type in searialization, explaining the None value.

What did you expect to happen?

  1. I would have expected the parameter "activation": katib.search.categorical(["relu","gelu", "elu"]) to have properly instantiated a V1beta1ParameterSpec object with both list and distribution set to ["relu","gelu", "elu"].
  2. I would have expected the submission as such to have been accepted and Katib to have run the training task.

Environment

Kubernetes version:

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.14", GitCommit:"f678fbc2f5af87ec6552b7ac6ebbe26258b0d120", GitTreeState:"clean", BuildDate:"2024-05-14T10:53:32Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"linux/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"29", GitVersion:"v1.29.10", GitCommit:"f0c1ea863533246b6d3fda3e6addb7c13c8a6359", GitTreeState:"clean", BuildDate:"2024-10-23T15:29:12Z", GoVersion:"go1.22.8", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.29) exceeds the supported minor version skew of +/-1

Katib controller version:


# Managed instance. I don't have access to this. I am on the Azure deployment of Charmed Kubeflow 1.9



Katib Python SDK version:

Version: 0.18.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

david-thrower avatar Apr 10 '25 21:04 david-thrower

@juliusvonkohout FYI, confirmed bug.

david-thrower avatar Apr 10 '25 22:04 david-thrower

FYI only @StefanoFioravanzo

david-thrower avatar Apr 10 '25 22:04 david-thrower

Then lets tag the maintainers @Electronic-Waste @tenzen-y @andreyvelich

juliusvonkohout avatar Apr 14 '25 13:04 juliusvonkohout

Thank you @juliusvonkohout

david-thrower avatar Apr 14 '25 14:04 david-thrower

Thank you to report this @david-thrower! @mahdikhashan @helenxie-bit @Electronic-Waste please could you help us taking a look at this bug ?

andreyvelich avatar Apr 14 '25 22:04 andreyvelich

/area sdk

andreyvelich avatar Apr 14 '25 22:04 andreyvelich

One related comment is that a more global fix that may be worth considering is replacing https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/models/* with Pydantic models representing the same schemas. It could remove a lot of boilerplate code, like "getters and setters", while still enforcing the schema, correct typing, and proper serialization. We would probably just have to create a subclass of float that enforces the range of a Double in Go, for example, to ensure that doubles serialize identically, otherwise, it may be fairly straightforward. There could be other benefits too like integration with SQLalchemy ORM.

With the newest LLMs like LLAMA 4 Maveric which has a 10 million token context window and Gemini Pro, it may be practical to refactor it in bulk by sending entire modules through for AI refactoring, plus, for context sending the entire codebase and the docs for Pydantic through the LLM with it, ... so it may not be quite as much work as it sounds like ...

david-thrower avatar Apr 15 '25 04:04 david-thrower

Thank you to report this @david-thrower! @mahdikhashan @helenxie-bit @Electronic-Waste please could you help us taking a look at this bug ?

thanks for mentioning me here, i'll look into it by tomorrow.

mahdikhashan avatar Apr 15 '25 08:04 mahdikhashan

with Pydantic models representing the same schemas.

Yes, we should do that, similar to Kubeflow Trainer: https://github.com/kubeflow/trainer/blob/master/sdk/kubeflow/trainer/models/trainer_v1alpha1_trainer.py#L20

We should migrate the SDK to use the newest version of OpenAPI generator: https://github.com/kubeflow/trainer/blob/master/hack/python-sdk/gen-sdk.sh#L34

andreyvelich avatar Apr 15 '25 12:04 andreyvelich

/assign

mahdikhashan avatar Apr 16 '25 14:04 mahdikhashan

@mahdikhashan Thanks for taking this! Feel free to ask any questions if needed.

/remove-label lifecycle/needs-triage

Electronic-Waste avatar Apr 17 '25 02:04 Electronic-Waste

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jul 16 '25 05:07 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Aug 05 '25 05:08 github-actions[bot]