manifests
manifests copied to clipboard
Distributions readiness for KF 1.5
Prev issue https://github.com/kubeflow/manifests/issues/2038
Distribution testing phase - handbook
The goal of this issue is to track the progress of distributions alongside the 1.5 release, and coordinate our communications. First goal is to expose any issues we will be bumping into here, so that all distros can keep an eye on issues that arise.
While we hope all distros would manage to be ready when the KF 1.5 release is out, this is sometimes impossible to achieve. In this issue we want to both keep track of the progress of distributions, towards the KF 1.5 release, but also which of the distros will be working on KF 1.5 even if they can't meet the KF 1.5 deadline.
Without further ado, here's the list of distros we have in mind:
Distribution | Representatives | State |
---|---|---|
Arrikto EKF | @kimwnasptd | :x: |
Arrikto MiniKF | @kimwnasptd | :x: |
Azure | :question: | :x: |
AWS | @surajkota | :x: |
Charmed Kubeflow | @DomFleischmann | :x: |
Google Cloud | @zijianjoy | ✅ |
IBM | @yhwang | ✅ |
Nutanix | @johnugeorge | ✅ |
Kubeflow with Argo CD | @DavidSpek | :x: |
Openshift | @nakfour @LaVLaS | :x: |
So lets use this issue to expose our state while testing the KF 1.5 release, and also give heads up to users about the progress of distros with the KF 1.5 release
We urge everyone to start their testing from the latest v1.5.0-rc.1
manifests tag. If anyone bumps into a problem, please open an issue and add a comment here as well so that we can all by in sync.
Regarding Arrikto's plans for the KF 1.5 release, we are targeting to also have our products ready for the deadline. But even if we don't manage, we will still be testing these following weeks and reporting bugs.
And also cc @kubeflow/release-team
Hello @kimwnasptd , which model-web-app
does central dashboard integrate with? There are KFserving and KServe. I am curious how to configure between these two web apps in the central dashboard.
Created a tracking issue for AWS distribution work - https://github.com/awslabs/kubeflow-manifests/issues/91
We are targeting to complete Generic/Vanilla Kubeflow i.e. as-is from this repository working on EKS as part of distribution testing phase. Other features and release will follow
Hello Kimonas, I would like to provide an update which requires changes to manifests as we are validating the Google Cloud distribution.
- Update KFP to v1.8.1-rc.0: https://github.com/kubeflow/pipelines/releases/tag/1.8.1-rc.0. This includes only fixes and no feature.
- I encountered issues when deploying kfserving endpoint using mnist sample. The way I resolved this issue is by running the following command:
kubectl patch mutatingwebhookconfiguration inferenceservice.serving.kubeflow.org --patch '{"webhooks":[{"name": "inferenceservice.kfserving-webhook-server.v1beta1.defaulter","objectSelector":{"matchExpressions":[{"key":"serving.kubeflow.org/inferenceservice", "operator": "Exists"}]}}]}'
kubectl patch ValidatingWebhookConfiguration inferenceservice.serving.kubeflow.org --patch '{"webhooks":[{"name": "inferenceservice.kfserving-webhook-server.v1beta1.validator","objectSelector":{"matchExpressions":[{"key":"serving.kubeflow.org/inferenceservice", "operator": "Exists"}]}}]}'
I think it is related to https://github.com/kserve/kserve/issues/568#issuecomment-665353635. My testing environment is GKE v1.20.12. The error message is:
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': '48360d9d-9621-43e8-a580-f40d74568b19', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '3e136267-4e52-4e29-9aa1-764e7dadc339', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'b7e6649d-ea7f-442f-b7b5-7ea82514ebd3', 'Date': 'Wed, 23 Feb 2022 22:54:10 GMT', 'Content-Length': '717'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"inferenceservice.kfserving-webhook-server.v1beta1.validator\": Post \"
[https://kfserving-webhook-server-service.kubeflow.svc:443/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s](https://kfserving-webhook-server-service.kubeflow.svc/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s)
\": x509: certificate signed by unknown authority","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"inferenceservice.kfserving-webhook-server.v1beta1.validator\": Post \"[https://kfserving-webhook-server-service.kubeflow.svc:443/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s](https://kfserving-webhook-server-service.kubeflow.svc/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s)\": x509: certificate signed by unknown authority"}]},"code":500}
https://kfserving-webhook-server-service.kubeflow.svc:443/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s
\": x509: certificate signed by unknown authority"}]},"code":500}
- I encountered the following issue for saving data during the
Kubeflow - Serve Model using KFServing
step: https://github.com/kubeflow/pipelines/blob/master/samples/contrib/kubeflow-e2e-mnist/kubeflow-e2e-mnist.ipynb
Traceback (most recent call last):
File "kfservingdeployer.py", line 437, in <module>
main()
{'apiVersion': 'serving.kubeflow.org/v1beta1', 'kind': 'InferenceService', 'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'}, 'creationTimestamp': '2022-02-23T23:16:40Z', 'finalizers': ['inferenceservice.finalizers'], 'generation': 2, 'managedFields': [{'apiVersion': 'serving.kubeflow.org/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'.': {}, 'f:sidecar.istio.io/inject': {}}}, 'f:spec': {'.': {}, 'f:predictor': {'.': {}, 'f:tensorflow': {'.': {}, 'f:storageUri': {}}}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2022-02-23T23:16:40Z'}, {'apiVersion': 'serving.kubeflow.org/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {}}, 'f:spec': {'f:predictor': {'f:tensorflow': {'f:name': {}, 'f:resources': {}}}}, 'f:status': {}}, 'manager': 'manager', 'operation': 'Update', 'time': '2022-02-23T23:16:40Z'}], 'name': 'mnist-e2e-v1beta1-validator', 'namespace': 'jamxl', 'resourceVersion': '110397', 'uid': '706c317d-dc10-4f6d-94c4-307e42a5d7be'}, 'spec': {'predictor': {'tensorflow': {'name': '', 'resources': {}, 'storageUri': 'pvc://end-to-end-pipeline-6wmv9-model-volume/'}}}}
File "kfservingdeployer.py", line 394, in main
for condition in model_status["status"]["conditions"]:
KeyError: 'status'
time="2022-02-23T23:21:43.263Z" level=error msg="cannot save artifact /tmp/outputs/InferenceService_Status/data" argo=true error="stat /tmp/outputs/InferenceService_Status/data: no such file or directory"
Error: exit status 1
Do you know how to resolve the last issue? @kimwnasptd @andreyvelich
@kimwnasptd Checking in here from AWS, attempting to do a vanilla installation into a fresh EKS cluster on kubernetes version 1.19 installed the manifests using the single-line command. Cannot connect to port-fowarding and looks like the issue is down to the cache-deployer-deployment
pod being stuck in an error state.
echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.' ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
When changing the image back to 1.5.0 from 1.8.0 in cache-deployer , it is working properly again. I saw you had run into the same issue do you know how to resolve it? https://github.com/kubeflow/pipelines/issues/7093#issuecomment-1024466580
images: name: gcr.io/ml-pipeline/cache-deployer newTag: 1.8.0
After discussion and debugging, we found that the issues 2 and 3 in https://github.com/kubeflow/manifests/issues/2146#issuecomment-1049328830 are because I deploy KFServing and KServe together. My current suggestion is to deploy only one of them (which is kfserving), until we figure out how to migrate to kserve successfully and validate it using an updated mnist E2E script.
@zijianjoy @ryansteakley thank you very much for exposing your progress!
Hello @kimwnasptd , which model-web-app does central dashboard integrate with? There are KFserving and KServe. I am curious how to configure between these two web apps in the central dashboard.
I'll provide some instructions for this very soon, on how someone will be able to use the KServe app. I'll also make this the default app that will be used by the dashboard, but there are some rough edges right now. I'll create the issues accordingly and give a heads up here again.
- Update KFP to v1.8.1-rc.0: https://github.com/kubeflow/pipelines/releases/tag/1.8.1-rc.0. This includes only fixes and no feature.
I'll also make a PR to update our manifests with this latest RC.
- I encountered issues when deploying kfserving endpoint using mnist sample. The way I resolved this issue is by running the following command:
I haven't bumped into this while testing the manifests. It's also not clear to me yet why this error happened now, since we had the same KFServing 0.6.1 manifests from the KF 1.4 release. In any case, thank you James for providing instructions for handling it. I'll look more into it.
- I encountered the following issue for saving data during the Kubeflow - Serve Model using KFServing step: https://github.com/kubeflow/pipelines/blob/master/samples/contrib/kubeflow-e2e-mnist/kubeflow-e2e-mnist.ipynb
I hadn't bumped into this as well. It looks like the InferenceService fails to get a status
field. I'll open a new issue for this to track it independently and ping you there as well to further debug
@ryansteakley regarding https://github.com/kubeflow/manifests/issues/2146#issuecomment-1050393366 can you double check you are using the v1.5.0-rc.1
of the manifests?
That RC includes KFP 1.8.0, which in turn includes the fix for the cache-deployer
AFAIK https://github.com/kubeflow/pipelines/pull/7273
@kimwnasptd Yes, I'm checking out the v1.5.0-rc.1 tag of the manifests to test the vanilla kubeflow on EKS 1.19 using 1.20 kubectl locally.
@kimwnasptd Google Cloud distribution is ready. 🚀
@kimwnasptd IBM IKS is ready for k8s 1.21. However, I am waiting for the knative 0.22.3 and going to try it out on k8s 1.22
@kimwnasptd Kubeflow Azure distribution has remained in v1.2 for two years, while v1.5 is coming up soon. Do you know if there is any plan to release Azure distribution with a more recent version? Who is the point of contact/representative?
I believe this is blocking Azure users from using kubeflow. v1.2 is on old k8s and istio versions, and v1.4 has no clear documents for Azure, it is quite hard to make it run on k8s 1.20+, which AKS only supports.
A heads up, we've cut the new RC of the manifests.
I've added a more detailed explanation in https://github.com/kubeflow/manifests/issues/2112#issuecomment-1059444792
@kimwnasptd Kubeflow Azure distribution has remained in v1.2 for two years, while v1.5 is coming up soon. Do you know if there is any plan to release Azure distribution with a more recent version? Who is the point of contact/representative?
@pwzhong unfortunately I don't have any more insights on this. We've tried to reach out to the maintainers throughout the releases, but we didn't get any feedback.
@kimwnasptd Nutanix Karbon is ready for k8s 1.21. Tested with latest RC - v1.5.0-rc.2
update from AWS side, Status: GREEN
Given timeframe of testing, we tested have tested 1.5.0-rc2 and will continue testing
Manually tested that current kubeflow/manifests master works with EKS 1.20 also successfully ran the https://github.com/kubeflow/manifests/tree/master/tests/e2e
Originally posted by @akartsky in https://github.com/awslabs/kubeflow-manifests/issues/91#issuecomment-1061328523
For IBM IKS, I re-ran all test cases using v1.5.0-rc.2. everything is good on k8s 1.21. we are using KServe. Kfserving is not verified.
From AWS, Status: GREEN
Successfully tested EKS 1.19, 1.20 & 1.21 and ran mnist-e2e test from this PR: https://github.com/kubeflow/manifests/pull/2164
AWS Release Tracker : https://github.com/awslabs/kubeflow-manifests/issues/91
From AWS, Status: RED
I just noticed an issue with cache-deployer Pod even with the latest master and rc2
I did not notice it before because this pod stays in running state for a few seconds before going into the crash loop and i was able to run sample pipelines/notebooks tests successfully
@jbottum, @kubeflow/release-team We would like to request an extension on the release so we can get help for resolving https://github.com/kubeflow/manifests/issues/2165 on EKS. Please let me know your thoughts
Apologies for last minute request. We were under impression the issues is no longer present in latest rc2
@yhwang @johnugeorge @zijianjoy had you checked the cache-deployer-deoloyment pod in your Kubeflow deployment while testing?
It stays in running state for few seconds while it retries and then restarts
@surajkota thanks for this report, if this is a reproducible bug, then I would consider it a P1, which could block the release. I am trying to understand the context, i.e. is this caching for Pipelines, https://www.kubeflow.org/docs/components/pipelines/overview/caching/. @kimwnasptd I believe you were going to test today with RC2 + final fixes. Have you been able to reproduce the referenced issue?
@surajkota for IBM IKS, I don't see that issue and the caching function works properly. I do have a test case to verify caching and it works well.
@yhwang @johnugeorge @zijianjoy had you checked the cache-deployer-deoloyment pod in your Kubeflow deployment while testing?
It stays in running state for few seconds while it retries and then restarts
I think we fixed the cert issue for minikube and IBM Cloud with this PR https://github.com/kubeflow/pipelines/pull/7273
I'm not sure how EKS handles the v1 CertificateSigningRequest, maybe you can update the list of Permitted subjects
?
https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers
@surajkota It works for Nutanix K8s 1.21 as well.
@theadactyl per our discussion, here is the tracking issue for KF 1.5,
A status update, we are trying to get to the bottom of this alongside @akartsky and @surajkota.
Currently our main culprit is the K8s API Server on EKS, that can't create the certificate in status.certificate
in the CertificateSigningRequest
object of the KFP cache-deployer
script [even though it's approved]. https://github.com/kubeflow/manifests/issues/2165
We are looking into gathering more logs from the control plane to have a better overview. This seems to be specific to EKS. If we won't get to the bottom of it within the next 2 hours I'll cut the final release, and we'll be more than happy to include any fixes necessary in a KF 1.5.1 patch release
We've gotten to the bottom of the issue. This is a problem with any K8s cluster that does not support using signerName: kubernetes.io/kubelet-serving
in CertificateSigningRequests
, and EKS is such a case.
I want to further understand the following first:
- What is the best practice around such certificates?
- Is it a problem to give a certificate, aimed to be used by kubelet, to the
cache-deployer
webhook? - What is the long term solution and how quickly could it be implemented?
I'd like to first have an answer for the above, before pushing the release button. For this I'll be delaying the release just for one more day, to take a look with a more clear mind and have answers on the above and a solid plan going forward.
We'll also add more technical details into https://github.com/kubeflow/manifests/issues/2165, which we'll at some point bring back to the KFP repo to discuss next steps.
cc @kubeflow/release-team
Thanks @kimwnasptd for the summary. Adding more context:
Both the PRs(https://github.com/kubeflow/pipelines/pull/6668, https://github.com/kubeflow/pipelines/pull/7273) are related. cache-deployer-deployment
pod is requesting a cert for CSR with signerName kubernetes.io/kubelet-serving
.
EKS only issues certificates for CSRs with signerName kubernetes.io/kubelet-serving
for actual kubelets based on the information in the official K8s documentation:
kubernetes.io/kubelet-serving`: signs serving certificates that are honored as a valid kubelet serving certificate by the API server, but has no other guarantees. Never auto-approved by kube-controller-manager.
It is not supported since it is not recommended in Kubernetes upstream and EKS believes allowing this is unsafe. Kubernetes is recommending to use cert manger controller instead which is already being discussed here: https://github.com/kubeflow/pipelines/issues/4695. IMO this is the right long term fix.
But given the timeframe, I am not sure if it is feasible to complete this. Since this Kubeflow release does not aim to support 1.22, an alternative for this release is to revert both the PRs and use CSR v1beta1
API and signerName legacy-unknown
. This would mean pipelines would only work in K8s 1.21 and below since kubernetes.io/legacy-unknown
is not supported in stable v1 API of CSR and hence it will not work for K8s 1.22 and above. https://github.com/kubeflow/pipelines/issues/4695 will need to be addressed for K8s 1.22 and above.
Another alternative is to release 1.5.1 with the right fix i.e. using cert manager if other distributions do not see this as an issue.
Please let us know your thoughts on this
Originally posted by @surajkota in https://github.com/kubeflow/manifests/issues/2165#issuecomment-1063597308