[Kubeflow 1.10] Distributions and Kubeflow
This issue will be used to track the progress of and coordinate with distributions along the 1.10 release.
While we hope all distros will manage to be ready when the KF 1.10 release is out, this is sometimes difficult to achieve. In this issue, we want to both keep track of the progress of distributions towards the KF 1.10 release and also know which of the distros will be working on KF 1.10 (testing during the distribution testing cycle) even if they can't meet the KF 1.10 deadline.
Tagging distribution owners identified from previous releases (Any new or missed distro owners, please comment on this issue)
| Distribution | Representative(s) | State |
|---|---|---|
| AWS | @surajkota | |
| Charmed Kubeflow | @DnPlas @mvlassis |
|
| Google Cloud | @zijianjoy @chensun |
|
| IBM IKS | @yhwang | |
| Microsoft | ||
| Nutanix | @johnugeorge @saileshd1402 @nagar-ajay |
Will participate in 1.10 |
| Red Hat OpenShift AI | @rimolive | Will participate in 1.10 |
| Oracle Cloud Infrastructure | @julioo | |
| DeployKF | @thesuperzapper | Will participate in 1.10 |
| VMWare | @liuqi @xujinheng |
|
| QBO | @alexeadem | Will participate in 1.10 |
Please let us know if you'll be participating in the 1.10 release by answering the following questions:
- Are you planning on having your distro ready in sync with the KF 1.10 release?
- Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)?
- If you cannot participate, when can the community expect your distro to be ready for release 1.10?
Please note the release timelines are being discussed in https://github.com/kubeflow/community/pull/761.
cc @tarilabs @juliusvonkohout @varodrig @diegolovison @tombuuz @dpoulopoulos @saileshd1402 @mvlassis @tarekabouzeid @hbelmiro @milosjava @jbottum
Are you planning on having your distro ready in sync with the KF 1.10 release? yes Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)? yes If you cannot participate, when can the community expect your distro to be ready for release 1.10? n/a
Hi @rimolive, Could you please add @nagar-ajay and me as Nutanix distribution owners alongside @johnugeorge?
Answers to the distribution participation questions:
- Are you planning on having your distro ready in sync with the KF 1.10 release? Yes
- Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)? Yes
- If you cannot participate, when can the community expect your distro to be ready for release 1.10? N/A
As the current distribution owner for Red Hat OpenShift AI, I will add the answer to the questions:
- Are you planning on having your distro ready in sync with the KF 1.10 release? Yes
- Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)? Yes
- If you cannot participate, when can the community expect your distro to be ready for release 1.10? N/A
-
deployKF plans to release a GA version that includes the 1.10 versions within a reasonable timeframe of the manifest release.
- There may also be a deployKF RC version released before the final 1.10.0 is cut, depending on how stable everything is.
-
As usual, I will also give feedback on the manifests RCs.
-
See above
Regarding the Charmed Kubeflow distribution:
- Are you planning on having your distro ready in sync with the KF 1.10 release?
- Yes
- Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)?
- Yes
- If you cannot participate, when can the community expect your distro to be ready for release 1.10?
- N/A
Also, if its possible, keep only myself and not @DnPlas as a point of contact, since I communicate everything with the team :)
/assign @rimolive
@rimolive and @jbottum to follow up on this. let's follow up next week.
@rimolive and @jbottum to follow up on this. Let's sync up this week and feel free to add any comments here.
from Ricardo from Release meeting progressing
No updates so far from @rimolive
@rimolive any news on this?
@rimolive @jbottum I'm following up on the distributions - any news on this?
Calling all Distribution owners! We are planning to release rc.2 next Monday March 3rd, and we'll officially begin the distribution testing. One concern raised is that our schedule to release GA is March 31st, and the deadline for distribution testing is very tight.
We'd like to gather more feedback about this concern from the other distributions so we can plan a new release date. I really appreciate any feedback so we can decide on keep the original schedule or delay the release date.
You can test on the 1.10 branch and https://github.com/kubeflow/manifests/milestone/1 is the milestone with current issues. https://github.com/kubeflow/pipelines/pull/11669 is also quite relevant.
Calling all Distribution owners! We are planning to release rc.2 next Monday March 3rd, and we'll officially begin the distribution testing. One concern raised is that our schedule to release GA is March 31st, and the deadline for distribution testing is very tight.
We'd like to gather more feedback about this concern from the other distributions so we can plan a new release date. I really appreciate any feedback so we can decide on keep the original schedule or delay the release date.
I'm ok with your timing. As long as we don't run into issues testing should be done within those timelines.
Calling all Distribution owners. With rc.2 release last week, we are good to go with Distribution testing. We need your feedback if testing is running fine and we need this asap. For the Distribution owners who did not yet confirm participation in Distribution Testing, let me know if you can run the tests.
cc @mvlassis @zijianjoy @chensun @yhwang @johnugeorge @saileshd1402 @nagar-ajay @julioo @thesuperzapper @liuqi @xujinheng @alexeadem
https://github.com/kubeflow/manifests/tree/v1.10-branch is the branch to test, because it will always be ahead of the RCs.
Hi @rimolive, thank you for reaching out and keeping us in the loop!
On our side (Charmed Kubeflow distribution), because of the delays in the RC release for some of the components (e.g. Notebooks, Katib), we are still currently wrapping up the updates across all distribution artifacts to align them with the latest RC versions. We expect this to be done by the end of this week, such that we can start testing the full bundle on Monday, March 17th, across all our use cases and product integrations to investigate whether regressions are present.
Currently, the plan is 1 week behind schedule, so delaying the release by an additional week would be beneficial and very much appreciated. This extra time would allow us to test the full deployment thoroughly and flag any issues to you, based on both integration testing and our Solution QA extensive testing.
Providing an update for Charmed Kubeflow:
The QA team tested the bundle and found with no issues. This means that we're going to release a beta version of the distribution later today.
Hi, sorry for the delay. I've tested QBO with Kubeflow version v1.10.0-rc.2. Everything works as expected now, but I had to make a few changes::
DEX/JWT
Clearing site data or opening an incognito window was necessary to get past this error.
Jwks doesn't have key to match kid or alg from Jwt
RBAC
cat istio-ingressgateway-sds-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: istio-ingressgateway-sds
namespace: istio-system
subjects:
- kind: ServiceAccount
name: istio-ingressgateway-service-account
namespace: istio-system
roleRef:
kind: Role
name: istio-ingressgateway-sds
apiGroup: rbac.authorization.k8s.io
and
cat istio-ingressgateway-sds.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: istio-ingressgateway-sds
namespace: istio-system
rules:
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "watch", "list"]
were necessary to access the Kubeflow Web UI
kustomize
The options --server-side --force-conflicts are necessary, or I'll get the following errors when running this command. I see you added them here as well:
https://github.com/kubeflow/manifests/blob/0016e6b8c24c4ee34342c76b7a738ade5e494682/README.md?plain=1#L159
Error from server (Invalid): error when creating "STDIN": CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: metadata.annotations: Too long: may not be more than 262144 bytes
Error from server (Invalid): error when creating "STDIN": CustomResourceDefinition.apiextensions.k8s.io "paddlejobs.kubeflow.org" is invalid: metadata.annotations: Too long: may not be more than 262144 bytes
Error from server (Invalid): error when creating "STDIN": CustomResourceDefinition.apiextensions.k8s.io "pytorchjobs.kubeflow.org" is invalid: metadata.annotations: Too long: may not be more than 262144 bytes
After those changes, Kubeflow is working as expected with the NVIDIA GPU Operator and the following components:
NAME CHART VERSION APP VERSION DESCRIPTION
nvidia/gpu-operator v24.9.2 v24.9.2 NVIDIA GPU Operator creates/configures/manages ...
NVIDIA-SMI 570.124.06
kubectl version
Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.32.3
kubectl get pods --all-namespaces -o jsonpath="{..image}" | sed 's/ /\n/g' | sort | uniq
docker.io/istio/pilot:1.24.2
docker.io/istio/proxyv2:1.24.2
docker.io/kindest/kindnetd:v20220726-ed811e41
docker.io/kindest/local-path-provisioner:v0.0.22-kind.0
docker.io/kserve/kserve-controller:v0.14.1
docker.io/kserve/kserve-localmodel-controller:v0.14.1
docker.io/kserve/models-web-app:v0.14.0-rc.0
docker.io/kubeflow/training-operator:v1-5170a36
docker.io/kubeflowkatib/katib-controller:v0.18.0-rc.0
docker.io/kubeflowkatib/katib-db-manager:v0.18.0-rc.0
docker.io/kubeflowkatib/katib-ui:v0.18.0-rc.0
docker.io/kubeflownotebookswg/centraldashboard:v1.10.0-rc.1
docker.io/kubeflownotebookswg/jupyter-scipy:v1.10.0-rc.1
docker.io/kubeflownotebookswg/jupyter-web-app:v1.10.0-rc.1
docker.io/kubeflownotebookswg/kfam:v1.10.0-rc.1
docker.io/kubeflownotebookswg/notebook-controller:v1.10.0-rc.1
docker.io/kubeflownotebookswg/poddefaults-webhook:v1.10.0-rc.1
docker.io/kubeflownotebookswg/profile-controller:v1.10.0-rc.1
docker.io/kubeflownotebookswg/pvcviewer-controller:v1.10.0-rc.1
docker.io/kubeflownotebookswg/tensorboard-controller:v1.10.0-rc.1
docker.io/kubeflownotebookswg/tensorboards-web-app:v1.10.0-rc.1
docker.io/kubeflownotebookswg/volumes-web-app:v1.10.0-rc.1
docker.io/library/mysql:8.0.29
docker.io/library/python:3.9
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:e70bc675f97778da144157f125b3001124ba7a5903b85dab9e77776352fea1c7
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:7d76a6d42d139ed53aae3ca2dfd600b1c776eb85a17af64dd1b604176a4b132a
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:cc39d40985f7b37ba384a857d194a24ac5eae7e204aac4ed9bf4ebfd8d62e721
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:59c2e7ad52cea17bedfc2aca9b9e33060bb34f04d35fd71fe61147bcbdb881e4
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:0e47362d044f8eac84595ed0a9fdf22e5dd5a07cc7a5df74e93eb5ad17ad4827
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:d42e2f83c9018779465860fdc67ce6ada3eac8ba8c47c5c2127c0bb45f9b328a
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/workflow-controller:v3.4.17-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.14.0
ghcr.io/dexidp/dex:v2.41.1
ghcr.io/kubeflow/kfp-api-server:2.4.1
ghcr.io/kubeflow/kfp-cache-deployer:2.4.1
ghcr.io/kubeflow/kfp-cache-server:2.4.1
ghcr.io/kubeflow/kfp-frontend:2.4.1
ghcr.io/kubeflow/kfp-metadata-envoy:2.4.1
ghcr.io/kubeflow/kfp-metadata-writer:2.4.1
ghcr.io/kubeflow/kfp-persistence-agent:2.4.1
ghcr.io/kubeflow/kfp-scheduled-workflow-controller:2.4.1
ghcr.io/kubeflow/kfp-viewer-crd-controller:2.4.1
ghcr.io/kubeflow/kfp-visualization-server:2.4.1
ghcr.io/metacontroller/metacontroller:v4.11.22
kserve/kserve-controller:v0.14.1
kserve/kserve-localmodel-controller:v0.14.1
kserve/models-web-app:v0.14.0-rc.0
kubeflow/training-operator:v1-5170a36
kubeflownotebookswg/jupyter-scipy:v1.10.0-rc.1
mysql:8.0.29
nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.9.2
nvcr.io/nvidia/gpu-operator:v24.9.2
nvcr.io/nvidia/k8s-device-plugin:v0.17.0
nvcr.io/nvidia/k8s/container-toolkit:v1.17.4-ubuntu20.04
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04
python:3.9
quay.io/brancz/kube-rbac-proxy:v0.13.1
quay.io/brancz/kube-rbac-proxy:v0.18.0
quay.io/brancz/kube-rbac-proxy:v0.8.0
quay.io/jetstack/cert-manager-cainjector:v1.16.1
quay.io/jetstack/cert-manager-controller:v1.16.1
quay.io/jetstack/cert-manager-webhook:v1.16.1
quay.io/oauth2-proxy/oauth2-proxy:v7.7.1
registry.k8s.io/coredns/coredns:v1.11.3
registry.k8s.io/etcd:3.5.16-0
registry.k8s.io/kube-apiserver-amd64:v1.32.3
registry.k8s.io/kube-apiserver:v1.32.3
registry.k8s.io/kube-controller-manager-amd64:v1.32.3
registry.k8s.io/kube-controller-manager:v1.32.3
registry.k8s.io/kube-proxy-amd64:v1.32.3
registry.k8s.io/kube-proxy:v1.32.3
registry.k8s.io/kube-scheduler-amd64:v1.32.3
registry.k8s.io/kube-scheduler:v1.32.3
registry.k8s.io/nfd/node-feature-discovery:v0.16.6