common
common copied to clipboard
PR 135 makes replicaType updates hard
In this PR, https://github.com/kubeflow/common/pull/135/files, it changes rtype to ReplicaType.
However, it brings some challenges in operator upgrade.
- expectationPodsKey := expectation.GenExpectationPodsKey(jobKey, rtype)
+ expectationPodsKey := expectation.GenExpectationPodsKey(jobKey, apiv1.ReplicaType(rtype))
Using this line as an example, controller retrieve rtype from pod label. It's lower case at this moment.
https://github.com/kubeflow/tf-operator/blob/ff5aaf10bd8a665cbce72dd2cd1d2c0cd06f329e/pkg/apis/tensorflow/v1/types.go#L77
https://github.com/kubeflow/tf-operator/blob/ff5aaf10bd8a665cbce72dd2cd1d2c0cd06f329e/pkg/apis/pytorch/v1/pytorchjob_types.go#L62
apiv1.ReplicaType(rtype) is inaccurate and meaningless because we define Upper case role like Master in the past,
and this conversion give us a non-exist ReplicaType because it's lower case master.
we have some references like below.
https://github.com/kubeflow/tf-operator/blob/ff5aaf10bd8a665cbce72dd2cd1d2c0cd06f329e/pkg/controller.v1/pytorch/pytorch.go#L46
I feel like all the role should be lower case, we make sure we don't use ReplicaType for comparison. Instead, still use strings.toLower(string(rtype)) which means PR 135 still not helpful?
/cc @MartinForReal @gaocegege @kubeflow/wg-training-leads
This indeed feel like unnecessary and complicate the implementations. Kubeflow/common should aim to be easy for downstream operators to implement.
Maybe more helper functions would resolve comparison issue? I think we can add some type name convention by defining ReplicaType type and add more util functions.
Maybe more helper functions would resolve comparison issue? I think we can add some type name convention by defining ReplicaType type and add more util functions.
@MartinForReal
Yeah, I add something like https://github.com/kubeflow/tf-operator/pull/1388/files#diff-7600abe3663b532d64fec9dae52fcb3b972fb034c53a45fe0030fc61776d1612R56
Did you use kubeflow/common in operator other than kubeflow training operators?