Adding name field to non-custom predictor incorrectly triggers custom container logic
/kind bug
What steps did you take and what happened:
After adding a name field inside standard predictor blocks (e.g., sklearn, tensorflow), the system allowed me to create the inferenceservice, but failed at deployment time with an error indicating that the container kserve-container cannot be found.
What did you expect to happen:
If the predictor is non-custom ones, it's container should only be assigned to the default container, "kserve-container", based on the internal logic, and should not be user-configurable. If a user explicitly sets a name field under a standard predictor block, there should be an error message stopping the user at creation time, instead of letting he/she proceeds to a failure during revision deployment.
What's the InferenceService yaml:
To be more specific, here is a minimal example:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris-example"
spec:
predictor:
minReplicas: 1
sklearn:
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
protocolVersion: v1
name: "sklearn-iris-predictor"
resources:
limits:
cpu: 100m
memory: 512Mi
requests:
cpu: 100m
memory: 512Mi
This leads to the following failure (shown by kubectl get isvc $name -n $namespace -o yaml):
message: 'Revision "sklearn-iris-example-predictor-00001" failed with message:
admission webhook "inferenceservice.kserve-webhook-server.pod-mutator" denied
the request: Invalid configuration: cannot find container: kserve-container.'
reason: RevisionFailed
severity: Info
status: "False"
type: PredictorConfigurationReady
When the name: field under sklearn is removed, the deployment works as expected.
Root cause analysis:
Tracing through the code suggests that specifying the name: under a standard predictor (which is supposed to be non-customized)results in container name handling being unintentionally overridden. This likely causes both standard and custom predictor logic paths to be triggered simultaneously.
Normally, a standard predictor won't and shouldn't specified a name: field. The default function will automatically assign it like this:
k.Container.Name = constants.InferenceServiceContainerName
And if it is a custom predictor, which you don't need to specified a predictor pod name (‵PodSpec json:",inline"), the default function will do this:
// Default sets defaults on the resource
func (c *CustomPredictor) Default(config *InferenceServicesConfig) {
if len(c.Containers) == 0 {
c.Containers = append(c.Containers, corev1.Container{})
}
if len(c.Containers) == 1 || len(c.Containers[0].Name) == 0 {
c.Containers[0].Name = constants.InferenceServiceContainerName
}
setResourceRequirementDefaults(config, &c.Containers[0].Resources)
}
an example of custom predictor is here
It allows users to input their own container name, or will be kserve-container as default.
But when it comes to specifying name: field in a standard predictor, it seems like it not only activate the standard defaulting logic of standard predictor, but also accidentally activate part of the custom predictor path. And so leads to the overwrite in container name, which create error when kserve-container can't be found.
Also, because custom predictors do not require a predictor type keyword (like sklearn: or triton:), the mutual exclusive check is not triggered either, which allows both paths to co-exist unexpectedly.
( Maybe ensuring mutual exclusivity logic correctly differentiates between customized and standard predictor paths can be a good start to solve this issue ?)
Environment:
- Istio Version: 1.23.2
- Knative Version: v1.15.2
- KServe Version: v0.15.0
- Kubeflow version: No
- Cloud Environment:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm]
- Minikube/Kind version: kind v0.28.0
- Kubernetes version: (use
kubectl version): v1.32.3 - OS (e.g. from
/etc/os-release): Ubuntu 24.04
Additional Context
I'm trying to fix this issue and would really appreciate any design or implementation suggestions !
Since my expectation toward this issue is to prevent user misconfiguration at creation time, I’ve been exploring a potential solution that would validate the name field for standard predictors earlier in the workflow, instead of letting it silently pass and fail later during deployment. And this is my current idea:
in kserve/pkg/apis/serving/v1beta1/inference_service_validation.go insert the following validation logic:
if err := validatePredictor(isvc); err != nil {
return allWarnings, err
}
and an extra function toperform this check:
func validatePredictor(isvc *v1beta1.InferenceService) error {
predictor := isvc.Spec.Predictor
// Prevent 'name' field in standard predictor types
switch {
case predictor.SKLearn != nil && predictor.SKLearn.Name != "":
return fmt.Errorf("the 'name' field is not allowed in standard predictor 'sklearn'")
case predictor.XGBoost != nil && predictor.XGBoost.Name != "":
return fmt.Errorf("the 'name' field is not allowed in standard predictor 'xgboost'")
case predictor.Tensorflow != nil && predictor.Tensorflow.Name != "":
return fmt.Errorf("the 'name' field is not allowed in standard predictor 'tensorflow'")
case predictor.PyTorch != nil && predictor.PyTorch.Name != "":
return fmt.Errorf("the 'name' field is not allowed in standard predictor 'pytorch'")
case predictor.Triton != nil && predictor.Triton.Name != "":
return fmt.Errorf("the 'name' field is not allowed in standard predictor 'triton'")
case predictor.ONNX != nil && predictor.ONNX.Name != "":
return fmt.Errorf("the 'name' field is not allowed in standard predictor 'onnx'")
case predictor.HuggingFace != nil && predictor.HuggingFace.Name != "":
return fmt.Errorf("the 'name' field is not allowed in standard predictor 'huggingface'")
case predictor.PMML != nil && predictor.PMML.Name != "":
return fmt.Errorf("the 'name' field is not allowed in standard predictor 'pmml'")
case predictor.LightGBM != nil && predictor.LightGBM.Name != "":
return fmt.Errorf("the 'name' field is not allowed in standard predictor 'lightgbm'")
case predictor.Paddle != nil && predictor.Paddle.Name != "":
return fmt.Errorf("the 'name' field is not allowed in standard predictor 'paddle'")
case predictor.Model != nil && predictor.Model.Name != "":
return fmt.Errorf("the 'name' field is not allowed in standard predictor 'model'")
case predictor.WorkerSpec != nil && predictor.WorkerSpec.Name != "":
return fmt.Errorf("the 'name' field is not allowed in standard predictor 'workspec'")
}
return nil
}
This approach should catch the misconfiguration early during validation and help prevent confusing errors that currently happen only at deployment.
If anyone has alternative suggestions, concerns, or design input, I’d love to hear your thoughts. Any advice is more than welcome!
Well, I think it is cause due the -predictor suffix in the name as it is automatically added, did you try it?
@sivanantha321 fyi
Well, I think it is cause due the
-predictorsuffix in the name as it is automatically added, did you try it?@sivanantha321 fyi
Yes, I’ve tested both with and without the -predictor suffix, and the issue still persists. It looks like the root cause is setting a name under a standard predictor, which unintentionally activates the custom predictor path.
So I think we still need to add a validation at creation time to prevent users from setting name under standard predictors.
Thank you ! appreciate your insight on this. ! (sorry for the delay in replying, it's been a little busy lately 🙏🏼 )