training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Introduce the CustomResourceValidationExpressions feature (CEL validation)

Open tenzen-y opened this issue 2 years ago • 23 comments

/kind feature

Since k8s v1.25, the CustomResourceValidationExpressions feature to validate CRDs using Common Expression Language (CEL) without webhook servers is enabled by default.

Since this feature helps to find CRD validation errors for users, we need to introduce that feature once we stop supporting K8s v1.24.

For example, if the container name of replicaSpec is invalid, for now, we only output that error to controller logs; once we introduce CEL validation, we can return that error to end-users.

https://github.com/kubeflow/training-operator/blob/69813fbc37f33857e59044d14caa7e2765851d98/pkg/apis/kubeflow.org/v1/tensorflow_validation.go#L69-L74

tenzen-y avatar Dec 23 '22 05:12 tenzen-y

/assign

tenzen-y avatar May 20 '23 19:05 tenzen-y

Through my investigation, I found that we can not replace all validations with CEL validation due to exceeding cost budget.

For example, we can replace the below validation

https://github.com/kubeflow/training-operator/blob/4dd0d0909210edc7e1a31a5987cb32dc834d682b/pkg/apis/kubeflow.org/v1/pytorch_validation.go#L69-L72

with the below CEL validation:

// ReplicaSpec is a description of the replica
type ReplicaSpec struct {
...
	// Template is the object that describes the pod that
	// will be created for this replica. RestartPolicy in PodTemplateSpec
	// will be overide by RestartPolicy in ReplicaSpec
	// +kubebuilder:validation:XValidation:rule="has(self.spec.containers) && self.spec.containers.all(c, has(c.image) && size(c.image) > 0)",message=""
	Template v1.PodTemplateSpec `json:"template,omitempty"`
...

Forbidden: contributed to estimated rule cost total exceeding cost limit for entire OpenAPIv3 schema, spec.validation.openAPIV3Schema: Forbidden: x-kubernetes-validations estimated rule cost total for entire OpenAPIv3 schema exceeds budget by factor of more than 100x (try simplifying the rule, or adding maxItems, maxProperties, and maxLength where arrays, maps, and strings are declared)]

So, I think that we need to introduce the webhook validation instead of replacing the current validation with CEL validation.

@kubeflow/wg-training-leads WDYT?

tenzen-y avatar Aug 02 '23 09:08 tenzen-y

I remember, we had a discussion regarding webhooks. The arguments were deployment easiness(without web hooks) vs clean validation(with webhooks)

johnugeorge avatar Aug 03 '23 19:08 johnugeorge

The arguments were deployment easiness

@johnugeorge Are they concerned about certs for the webhook?

tenzen-y avatar Aug 03 '23 19:08 tenzen-y

Yes. That was the major raod block from what I remember

johnugeorge avatar Aug 04 '23 08:08 johnugeorge

@johnugeorge I think we can remove webhook installation barriers once we introduce cert generation logic similar to katib. WDYT?

Actually, I have experience implementing internal cert generation logic in 2 components (kubeflow/katib, kubernetes-sigs/kueue).

tenzen-y avatar Aug 04 '23 10:08 tenzen-y

Maybe, we generalize the katib cert generator, and then just import the cert-generator to the training-operator.

https://github.com/kubeflow/katib/blob/eac65bf8e31fe4d3a2c4dbb20be30d600c086157/cmd/katib-controller/v1beta1/main.go#L137-L145

https://github.com/kubeflow/katib/tree/eac65bf8e31fe4d3a2c4dbb20be30d600c086157/pkg/certgenerator/v1beta1

cc: @andreyvelich

tenzen-y avatar Aug 04 '23 10:08 tenzen-y

@tenzen-y Shall we move this to the next release?

johnugeorge avatar Aug 04 '23 10:08 johnugeorge

@tenzen-y Shall we move this to the next release?

Which does that mean training-operator v1.7 or v1.8?

tenzen-y avatar Aug 04 '23 10:08 tenzen-y

We are cutting first 1.7 RC in few days. I am afraid, we cannot complete testing within time if we want to aim this feature in 1.7. What are you thoughts on moving to 1.8?

johnugeorge avatar Aug 04 '23 11:08 johnugeorge

We are cutting first 1.7 RC in few days. I am afraid, we cannot complete testing within time if we want to aim this feature in 1.7. What are you thoughts on moving to 1.8?

It makes sense. We can work on this feature for v1.8. Actually, I don't have enough bandwidth for this feature.

tenzen-y avatar Aug 04 '23 11:08 tenzen-y

Also, we can create another issue for the webhook since this issue aims CEL validation.

tenzen-y avatar Aug 04 '23 11:08 tenzen-y

Thank you for creating this @tenzen-y. I agree that we should postpone the discussion around this. We need to identify pros and cons of using CEL vs Webhook Validation. As Johnu mentioned before, adding Webhook as a mandatory component for Training Operator will complicate end-user usage.

andreyvelich avatar Aug 04 '23 12:08 andreyvelich

Anyway, I will create an issue to discuss webhook since CEL validation isn't enough due to exceeding the cost budge.

https://github.com/kubeflow/training-operator/issues/1708#issuecomment-1661876525

tenzen-y avatar Aug 04 '23 15:08 tenzen-y

/remove-area 1.7.0

tenzen-y avatar Sep 14 '23 06:09 tenzen-y

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Dec 13 '23 10:12 github-actions[bot]

/remove-lifecycle stale

tenzen-y avatar Dec 15 '23 19:12 tenzen-y

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 14 '24 20:03 github-actions[bot]

/remove-lifecycle stale

tenzen-y avatar Mar 14 '24 20:03 tenzen-y

Moved to 1.9 release

johnugeorge avatar Apr 15 '24 14:04 johnugeorge

Moved to 1.9 release

I agreed with this decision in the offline meeting with @johnugeorge.

tenzen-y avatar Apr 15 '24 18:04 tenzen-y

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jul 14 '24 20:07 github-actions[bot]

/remove-lifecycle stale

tenzen-y avatar Jul 14 '24 22:07 tenzen-y