Failed parsing of V1beta1ParameterSpec in python 3.11.9 & kubeflow-katib: 0.18.0
What happened?
Issue described:
Environment details:
- Python 3.11.9 (Canonical managed Kubeflow Jupyter environment)
- Kubeflow SDK versions
- kubeflow: 0.0.1rc0
- kubeflow-katib: 0.18.0
- kubeflow-training: 1.9.0
TLDR:
- I tried to include the param
"activation": katib.search.categorical(["relu","gelu", "elu"])in a parameters argument for KatibClient.tune(). - this returned the error:
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validator.experiment.katib.kubeflow.org\" denied the request: feasibleSpace must be specified in spec.parameters[1]: {activation categorical { [] }}","code":400}
For reference, the relevant block of code:
def objective(parameters):
# This is rather long, hence omitted.
print f"val_binary_accuracy={result}"
# Import the SDK
import kubeflow.katib as katib
from kubeflow.katib import V1beta1FeasibleSpace
from kubeflow.katib import V1beta1ParameterSpec
# Workaround for failed parameter spec parsing ... katib.search.categorical(["relu","gelu", "elu"]),
foo = V1beta1ParameterSpec(
parameter_type="categorical",
feasible_space=V1beta1FeasibleSpace(distribution = ["relu","gelu", "elu"], list = ["relu","gelu", "elu"]),
)
print(type(foo))
print(foo)
parameters = {
"embedding_n": katib.search.int(min=10, max=18, step=1),
# Workaround that worked
"activation": foo,
# Failing code that should have worked
# "activation": katib.search.categorical(["relu","gelu", "elu"]),
"predecessor_level_connection_affinity_factor_first": katib.search.double(min=0.1, max=50, step=0.1),
"predecessor_level_connection_affinity_factor_main": katib.search.double(min=0.1, max=50, step=0.1),
"max_consecutive_lateral_connections": katib.search.int(min=1, max=50, step=1),
"p_lateral_connection": katib.search.double(min=0.1, max=50, step=0.1),
"num_lateral_connection_tries_per_unit": katib.search.int(min=1, max=50, step=1),
"learning_rate": katib.search.double(min=10 ** -5, max=0.3, step=10 ** -5),
"epochs": katib.search.int(min=1, max=25, step=1),
"batch_size": katib.search.int(min=1, max=35, step=1),
"dropout": katib.search.double(min=0.05, max=0.95, step=0.05),
"maximum_units_per_level": katib.search.int(min=5, max=10, step=1),
"maximum_neurons_per_unit": katib.search.int(min=1, max=9, step=1),
"temperature": katib.search.int(min=10**4, max=10**6, step=1000),
}
katib_client = katib.KatibClient(namespace=NAMESPACE)
algorithm_config = {
"algorithm_name": "multivariate-tpe",
"algorithm_settings": [
{"name": "n_startup_trials", "value": "7"},
{"name": "n_ei_candidates", "value": "24"},
{"name": "random_state", "value": "42"}
]
}
katib_client.tune(
name=JOB_NAME,
objective=objective,
parameters=parameters,
objective_metric_name="val_binary_accuracy",
algorithm_name = "multivariate-tpe",
algorithm_settings= algorithm_config,
max_trial_count=25,
parallel_trial_count=2,
max_failed_trial_count=15,
resources_per_trial={"cpu": "7", "memory": "21Gi"}
)
I did a little troubleshooting:
- I did some investigating by browsing the SDK code: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/search.py
- I found that the function search.categorical() has a parameter named
list. - I suspect that this could be the first potential problem, as this is a naming collision with the default Python keyword
listin the global scope. - Digging deeper, however, I found that there is a problem in line 65 on https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/search.py. search.categorical() passes the argument for the parameter
listto V1beta1FeasibleSpace without specifying which arg it pertains to. - When I manually reproduce what search.categorical() is doing internally as seen below and print this out, you see the issue becomes obvious ...
- The property
distributionis successfully set to the list which search.categorical() was configured with, but the propertylistis not being getting set, as you see belowV1beta1ParameterSpec.feasible_space.listis set toNone.
I do this:
foo = V1beta1ParameterSpec(
parameter_type="categorical",
feasible_space=V1beta1FeasibleSpace(["relu","gelu", "elu"]),
)
print(type(foo))
print(foo)
This prints out:
<class 'kubeflow.katib.models.v1beta1_parameter_spec.V1beta1ParameterSpec'>
{'feasible_space': {'distribution': ['relu', 'gelu', 'elu'],
'list': None, # <------------------------------------------------------------------------------------<<<< # I think this is the problem
'max': None,
'min': None,
'step': None},
'name': None,
'parameter_type': 'categorical'}
If I repeat this, but I explicitly reference the parameters*: distribution and list when instantiating the V1beta1FeasibleSpace which is used to instantiate V1beta1ParameterSpec, the properties ..distribution and ..list are both properly set:
my_list = ["relu","gelu", "elu"]
foo = V1beta1ParameterSpec(
parameter_type="categorical",
feasible_space=V1beta1FeasibleSpace(distribution = my_list,
list = my_list),
)
print(type(foo))
print(foo)
Now this looks correct:
<class 'kubeflow.katib.models.v1beta1_parameter_spec.V1beta1ParameterSpec'>
{'feasible_space': {'distribution': ['relu', 'gelu', 'elu'],
'list': ['relu', 'gelu', 'elu'],
'max': None,
'min': None,
'step': None},
'name': None,
'parameter_type': 'categorical'}
When I replaced the call search.categorical(['relu', 'gelu', 'elu']]) with the correctly, manually created V1beta1ParameterSpec ('activations': foo), in the parameters and submit the job using CatibClient.tune(), the API accepted and created the job without the error.
Suggested revisions:
- In line 65 on https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/search.py: Reference the parameters
distributionandlistexplicitly OR in V1beta1FeasibleSpace and V1beta1ParameterSpec, make sure that init is setting both self.distribution and self.list. - Rename the parameter 'list', as this is a naming collision with a Python keyword referencing the function list(), which may be causing an empty list to be instantiated and assigned when V1beta1FeasibleSpace.list() sets
self._list = list, which the empty list may be getting cast to a null type in searialization, explaining theNonevalue.
What did you expect to happen?
- I would have expected the parameter
"activation": katib.search.categorical(["relu","gelu", "elu"])to have properly instantiated a V1beta1ParameterSpec object with bothlistanddistributionset to["relu","gelu", "elu"]. - I would have expected the submission as such to have been accepted and Katib to have run the training task.
Environment
Kubernetes version:
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.14", GitCommit:"f678fbc2f5af87ec6552b7ac6ebbe26258b0d120", GitTreeState:"clean", BuildDate:"2024-05-14T10:53:32Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"linux/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"29", GitVersion:"v1.29.10", GitCommit:"f0c1ea863533246b6d3fda3e6addb7c13c8a6359", GitTreeState:"clean", BuildDate:"2024-10-23T15:29:12Z", GoVersion:"go1.22.8", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.29) exceeds the supported minor version skew of +/-1
Katib controller version:
# Managed instance. I don't have access to this. I am on the Azure deployment of Charmed Kubeflow 1.9
Katib Python SDK version:
Version: 0.18.0
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
@juliusvonkohout FYI, confirmed bug.
FYI only @StefanoFioravanzo
Then lets tag the maintainers @Electronic-Waste @tenzen-y @andreyvelich
Thank you @juliusvonkohout
Thank you to report this @david-thrower! @mahdikhashan @helenxie-bit @Electronic-Waste please could you help us taking a look at this bug ?
/area sdk
One related comment is that a more global fix that may be worth considering is replacing https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/models/* with Pydantic models representing the same schemas. It could remove a lot of boilerplate code, like "getters and setters", while still enforcing the schema, correct typing, and proper serialization. We would probably just have to create a subclass of float that enforces the range of a Double in Go, for example, to ensure that doubles serialize identically, otherwise, it may be fairly straightforward. There could be other benefits too like integration with SQLalchemy ORM.
With the newest LLMs like LLAMA 4 Maveric which has a 10 million token context window and Gemini Pro, it may be practical to refactor it in bulk by sending entire modules through for AI refactoring, plus, for context sending the entire codebase and the docs for Pydantic through the LLM with it, ... so it may not be quite as much work as it sounds like ...
Thank you to report this @david-thrower! @mahdikhashan @helenxie-bit @Electronic-Waste please could you help us taking a look at this bug ?
thanks for mentioning me here, i'll look into it by tomorrow.
with Pydantic models representing the same schemas.
Yes, we should do that, similar to Kubeflow Trainer: https://github.com/kubeflow/trainer/blob/master/sdk/kubeflow/trainer/models/trainer_v1alpha1_trainer.py#L20
We should migrate the SDK to use the newest version of OpenAPI generator: https://github.com/kubeflow/trainer/blob/master/hack/python-sdk/gen-sdk.sh#L34
/assign
@mahdikhashan Thanks for taking this! Feel free to ask any questions if needed.
/remove-label lifecycle/needs-triage
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.