manifests icon indicating copy to clipboard operation
manifests copied to clipboard

Distributions readiness for KF 1.5

Open kimwnasptd opened this issue 3 years ago • 31 comments

Prev issue https://github.com/kubeflow/manifests/issues/2038

Distribution testing phase - handbook

The goal of this issue is to track the progress of distributions alongside the 1.5 release, and coordinate our communications. First goal is to expose any issues we will be bumping into here, so that all distros can keep an eye on issues that arise.

While we hope all distros would manage to be ready when the KF 1.5 release is out, this is sometimes impossible to achieve. In this issue we want to both keep track of the progress of distributions, towards the KF 1.5 release, but also which of the distros will be working on KF 1.5 even if they can't meet the KF 1.5 deadline.

Without further ado, here's the list of distros we have in mind:

Distribution Representatives State
Arrikto EKF @kimwnasptd :x:
Arrikto MiniKF @kimwnasptd :x:
Azure :question: :x:
AWS @surajkota :x:
Charmed Kubeflow @DomFleischmann :x:
Google Cloud @zijianjoy
IBM @yhwang
Nutanix @johnugeorge
Kubeflow with Argo CD @DavidSpek :x:
Openshift @nakfour @LaVLaS :x:

So lets use this issue to expose our state while testing the KF 1.5 release, and also give heads up to users about the progress of distros with the KF 1.5 release

kimwnasptd avatar Feb 17 '22 19:02 kimwnasptd

We urge everyone to start their testing from the latest v1.5.0-rc.1 manifests tag. If anyone bumps into a problem, please open an issue and add a comment here as well so that we can all by in sync.

Regarding Arrikto's plans for the KF 1.5 release, we are targeting to also have our products ready for the deadline. But even if we don't manage, we will still be testing these following weeks and reporting bugs.

kimwnasptd avatar Feb 17 '22 19:02 kimwnasptd

And also cc @kubeflow/release-team

kimwnasptd avatar Feb 17 '22 19:02 kimwnasptd

Hello @kimwnasptd , which model-web-app does central dashboard integrate with? There are KFserving and KServe. I am curious how to configure between these two web apps in the central dashboard.

zijianjoy avatar Feb 18 '22 17:02 zijianjoy

Created a tracking issue for AWS distribution work - https://github.com/awslabs/kubeflow-manifests/issues/91

We are targeting to complete Generic/Vanilla Kubeflow i.e. as-is from this repository working on EKS as part of distribution testing phase. Other features and release will follow

surajkota avatar Feb 18 '22 23:02 surajkota

Hello Kimonas, I would like to provide an update which requires changes to manifests as we are validating the Google Cloud distribution.

  1. Update KFP to v1.8.1-rc.0: https://github.com/kubeflow/pipelines/releases/tag/1.8.1-rc.0. This includes only fixes and no feature.
  2. I encountered issues when deploying kfserving endpoint using mnist sample. The way I resolved this issue is by running the following command:
kubectl patch mutatingwebhookconfiguration inferenceservice.serving.kubeflow.org --patch '{"webhooks":[{"name": "inferenceservice.kfserving-webhook-server.v1beta1.defaulter","objectSelector":{"matchExpressions":[{"key":"serving.kubeflow.org/inferenceservice", "operator": "Exists"}]}}]}'
 
kubectl patch ValidatingWebhookConfiguration inferenceservice.serving.kubeflow.org --patch '{"webhooks":[{"name": "inferenceservice.kfserving-webhook-server.v1beta1.validator","objectSelector":{"matchExpressions":[{"key":"serving.kubeflow.org/inferenceservice", "operator": "Exists"}]}}]}'

I think it is related to https://github.com/kserve/kserve/issues/568#issuecomment-665353635. My testing environment is GKE v1.20.12. The error message is:

Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': '48360d9d-9621-43e8-a580-f40d74568b19', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '3e136267-4e52-4e29-9aa1-764e7dadc339', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'b7e6649d-ea7f-442f-b7b5-7ea82514ebd3', 'Date': 'Wed, 23 Feb 2022 22:54:10 GMT', 'Content-Length': '717'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"inferenceservice.kfserving-webhook-server.v1beta1.validator\": Post \"
[https://kfserving-webhook-server-service.kubeflow.svc:443/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s](https://kfserving-webhook-server-service.kubeflow.svc/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s)
\": x509: certificate signed by unknown authority","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"inferenceservice.kfserving-webhook-server.v1beta1.validator\": Post \"[https://kfserving-webhook-server-service.kubeflow.svc:443/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s](https://kfserving-webhook-server-service.kubeflow.svc/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s)\": x509: certificate signed by unknown authority"}]},"code":500}
https://kfserving-webhook-server-service.kubeflow.svc:443/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s
\": x509: certificate signed by unknown authority"}]},"code":500}
  1. I encountered the following issue for saving data during the Kubeflow - Serve Model using KFServing step: https://github.com/kubeflow/pipelines/blob/master/samples/contrib/kubeflow-e2e-mnist/kubeflow-e2e-mnist.ipynb
                                                                                                                
Traceback (most recent call last):
  File "kfservingdeployer.py", line 437, in <module>
    main()
{'apiVersion': 'serving.kubeflow.org/v1beta1', 'kind': 'InferenceService', 'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'}, 'creationTimestamp': '2022-02-23T23:16:40Z', 'finalizers': ['inferenceservice.finalizers'], 'generation': 2, 'managedFields': [{'apiVersion': 'serving.kubeflow.org/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'.': {}, 'f:sidecar.istio.io/inject': {}}}, 'f:spec': {'.': {}, 'f:predictor': {'.': {}, 'f:tensorflow': {'.': {}, 'f:storageUri': {}}}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2022-02-23T23:16:40Z'}, {'apiVersion': 'serving.kubeflow.org/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {}}, 'f:spec': {'f:predictor': {'f:tensorflow': {'f:name': {}, 'f:resources': {}}}}, 'f:status': {}}, 'manager': 'manager', 'operation': 'Update', 'time': '2022-02-23T23:16:40Z'}], 'name': 'mnist-e2e-v1beta1-validator', 'namespace': 'jamxl', 'resourceVersion': '110397', 'uid': '706c317d-dc10-4f6d-94c4-307e42a5d7be'}, 'spec': {'predictor': {'tensorflow': {'name': '', 'resources': {}, 'storageUri': 'pvc://end-to-end-pipeline-6wmv9-model-volume/'}}}}
  File "kfservingdeployer.py", line 394, in main
    for condition in model_status["status"]["conditions"]:
KeyError: 'status'
time="2022-02-23T23:21:43.263Z" level=error msg="cannot save artifact /tmp/outputs/InferenceService_Status/data" argo=true error="stat /tmp/outputs/InferenceService_Status/data: no such file or directory"
Error: exit status 1

Do you know how to resolve the last issue? @kimwnasptd @andreyvelich

zijianjoy avatar Feb 23 '22 23:02 zijianjoy

@kimwnasptd Checking in here from AWS, attempting to do a vanilla installation into a fresh EKS cluster on kubernetes version 1.19 installed the manifests using the single-line command. Cannot connect to port-fowarding and looks like the issue is down to the cache-deployer-deployment pod being stuck in an error state.

echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.' ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.

When changing the image back to 1.5.0 from 1.8.0 in cache-deployer , it is working properly again. I saw you had run into the same issue do you know how to resolve it? https://github.com/kubeflow/pipelines/issues/7093#issuecomment-1024466580

images: name: gcr.io/ml-pipeline/cache-deployer newTag: 1.8.0

ryansteakley avatar Feb 25 '22 00:02 ryansteakley

After discussion and debugging, we found that the issues 2 and 3 in https://github.com/kubeflow/manifests/issues/2146#issuecomment-1049328830 are because I deploy KFServing and KServe together. My current suggestion is to deploy only one of them (which is kfserving), until we figure out how to migrate to kserve successfully and validate it using an updated mnist E2E script.

zijianjoy avatar Feb 25 '22 20:02 zijianjoy

@zijianjoy @ryansteakley thank you very much for exposing your progress!

Hello @kimwnasptd , which model-web-app does central dashboard integrate with? There are KFserving and KServe. I am curious how to configure between these two web apps in the central dashboard.

I'll provide some instructions for this very soon, on how someone will be able to use the KServe app. I'll also make this the default app that will be used by the dashboard, but there are some rough edges right now. I'll create the issues accordingly and give a heads up here again.

  1. Update KFP to v1.8.1-rc.0: https://github.com/kubeflow/pipelines/releases/tag/1.8.1-rc.0. This includes only fixes and no feature.

I'll also make a PR to update our manifests with this latest RC.

  1. I encountered issues when deploying kfserving endpoint using mnist sample. The way I resolved this issue is by running the following command:

I haven't bumped into this while testing the manifests. It's also not clear to me yet why this error happened now, since we had the same KFServing 0.6.1 manifests from the KF 1.4 release. In any case, thank you James for providing instructions for handling it. I'll look more into it.

  1. I encountered the following issue for saving data during the Kubeflow - Serve Model using KFServing step: https://github.com/kubeflow/pipelines/blob/master/samples/contrib/kubeflow-e2e-mnist/kubeflow-e2e-mnist.ipynb

I hadn't bumped into this as well. It looks like the InferenceService fails to get a status field. I'll open a new issue for this to track it independently and ping you there as well to further debug

kimwnasptd avatar Feb 25 '22 23:02 kimwnasptd

@ryansteakley regarding https://github.com/kubeflow/manifests/issues/2146#issuecomment-1050393366 can you double check you are using the v1.5.0-rc.1 of the manifests?

That RC includes KFP 1.8.0, which in turn includes the fix for the cache-deployer AFAIK https://github.com/kubeflow/pipelines/pull/7273

kimwnasptd avatar Feb 25 '22 23:02 kimwnasptd

@kimwnasptd Yes, I'm checking out the v1.5.0-rc.1 tag of the manifests to test the vanilla kubeflow on EKS 1.19 using 1.20 kubectl locally.

ryansteakley avatar Feb 26 '22 05:02 ryansteakley

@kimwnasptd Google Cloud distribution is ready. 🚀

zijianjoy avatar Feb 26 '22 06:02 zijianjoy

@kimwnasptd IBM IKS is ready for k8s 1.21. However, I am waiting for the knative 0.22.3 and going to try it out on k8s 1.22

yhwang avatar Feb 28 '22 21:02 yhwang

@kimwnasptd Kubeflow Azure distribution has remained in v1.2 for two years, while v1.5 is coming up soon. Do you know if there is any plan to release Azure distribution with a more recent version? Who is the point of contact/representative?

I believe this is blocking Azure users from using kubeflow. v1.2 is on old k8s and istio versions, and v1.4 has no clear documents for Azure, it is quite hard to make it run on k8s 1.20+, which AKS only supports.

pwzhong avatar Mar 01 '22 04:03 pwzhong

A heads up, we've cut the new RC of the manifests.

I've added a more detailed explanation in https://github.com/kubeflow/manifests/issues/2112#issuecomment-1059444792

kimwnasptd avatar Mar 04 '22 19:03 kimwnasptd

@kimwnasptd Kubeflow Azure distribution has remained in v1.2 for two years, while v1.5 is coming up soon. Do you know if there is any plan to release Azure distribution with a more recent version? Who is the point of contact/representative?

@pwzhong unfortunately I don't have any more insights on this. We've tried to reach out to the maintainers throughout the releases, but we didn't get any feedback.

kimwnasptd avatar Mar 04 '22 19:03 kimwnasptd

@kimwnasptd Nutanix Karbon is ready for k8s 1.21. Tested with latest RC - v1.5.0-rc.2

johnugeorge avatar Mar 08 '22 10:03 johnugeorge

update from AWS side, Status: GREEN

Given timeframe of testing, we tested have tested 1.5.0-rc2 and will continue testing

Manually tested that current kubeflow/manifests master works with EKS 1.20 also successfully ran the https://github.com/kubeflow/manifests/tree/master/tests/e2e

Originally posted by @akartsky in https://github.com/awslabs/kubeflow-manifests/issues/91#issuecomment-1061328523

surajkota avatar Mar 08 '22 18:03 surajkota

For IBM IKS, I re-ran all test cases using v1.5.0-rc.2. everything is good on k8s 1.21. we are using KServe. Kfserving is not verified.

yhwang avatar Mar 08 '22 18:03 yhwang

From AWS, Status: GREEN

Successfully tested EKS 1.19, 1.20 & 1.21 and ran mnist-e2e test from this PR: https://github.com/kubeflow/manifests/pull/2164

AWS Release Tracker : https://github.com/awslabs/kubeflow-manifests/issues/91

akartsky avatar Mar 09 '22 01:03 akartsky

From AWS, Status: RED

I just noticed an issue with cache-deployer Pod even with the latest master and rc2

I did not notice it before because this pod stays in running state for a few seconds before going into the crash loop and i was able to run sample pipelines/notebooks tests successfully

akartsky avatar Mar 09 '22 03:03 akartsky

@jbottum, @kubeflow/release-team We would like to request an extension on the release so we can get help for resolving https://github.com/kubeflow/manifests/issues/2165 on EKS. Please let me know your thoughts

Apologies for last minute request. We were under impression the issues is no longer present in latest rc2

surajkota avatar Mar 09 '22 03:03 surajkota

@yhwang @johnugeorge @zijianjoy had you checked the cache-deployer-deoloyment pod in your Kubeflow deployment while testing?

It stays in running state for few seconds while it retries and then restarts

surajkota avatar Mar 09 '22 06:03 surajkota

@surajkota thanks for this report, if this is a reproducible bug, then I would consider it a P1, which could block the release. I am trying to understand the context, i.e. is this caching for Pipelines, https://www.kubeflow.org/docs/components/pipelines/overview/caching/. @kimwnasptd I believe you were going to test today with RC2 + final fixes. Have you been able to reproduce the referenced issue?

jbottum avatar Mar 09 '22 16:03 jbottum

@surajkota for IBM IKS, I don't see that issue and the caching function works properly. I do have a test case to verify caching and it works well.

yhwang avatar Mar 09 '22 18:03 yhwang

@yhwang @johnugeorge @zijianjoy had you checked the cache-deployer-deoloyment pod in your Kubeflow deployment while testing?

It stays in running state for few seconds while it retries and then restarts

I think we fixed the cert issue for minikube and IBM Cloud with this PR https://github.com/kubeflow/pipelines/pull/7273

I'm not sure how EKS handles the v1 CertificateSigningRequest, maybe you can update the list of Permitted subjects? https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers

Tomcli avatar Mar 09 '22 18:03 Tomcli

@surajkota It works for Nutanix K8s 1.21 as well.

johnugeorge avatar Mar 09 '22 18:03 johnugeorge

@theadactyl per our discussion, here is the tracking issue for KF 1.5,

jbottum avatar Mar 09 '22 19:03 jbottum

A status update, we are trying to get to the bottom of this alongside @akartsky and @surajkota.

Currently our main culprit is the K8s API Server on EKS, that can't create the certificate in status.certificate in the CertificateSigningRequest object of the KFP cache-deployer script [even though it's approved]. https://github.com/kubeflow/manifests/issues/2165

We are looking into gathering more logs from the control plane to have a better overview. This seems to be specific to EKS. If we won't get to the bottom of it within the next 2 hours I'll cut the final release, and we'll be more than happy to include any fixes necessary in a KF 1.5.1 patch release

kimwnasptd avatar Mar 09 '22 20:03 kimwnasptd

We've gotten to the bottom of the issue. This is a problem with any K8s cluster that does not support using signerName: kubernetes.io/kubelet-serving in CertificateSigningRequests, and EKS is such a case.

I want to further understand the following first:

  1. What is the best practice around such certificates?
  2. Is it a problem to give a certificate, aimed to be used by kubelet, to the cache-deployer webhook?
  3. What is the long term solution and how quickly could it be implemented?

I'd like to first have an answer for the above, before pushing the release button. For this I'll be delaying the release just for one more day, to take a look with a more clear mind and have answers on the above and a solid plan going forward.

We'll also add more technical details into https://github.com/kubeflow/manifests/issues/2165, which we'll at some point bring back to the KFP repo to discuss next steps.

cc @kubeflow/release-team

kimwnasptd avatar Mar 09 '22 22:03 kimwnasptd

Thanks @kimwnasptd for the summary. Adding more context:

Both the PRs(https://github.com/kubeflow/pipelines/pull/6668, https://github.com/kubeflow/pipelines/pull/7273) are related. cache-deployer-deployment pod is requesting a cert for CSR with signerName kubernetes.io/kubelet-serving.

EKS only issues certificates for CSRs with signerName kubernetes.io/kubelet-serving for actual kubelets based on the information in the official K8s documentation:

kubernetes.io/kubelet-serving`: signs serving certificates that are honored as a valid kubelet serving certificate by the API server, but has no other guarantees. Never auto-approved by kube-controller-manager.

It is not supported since it is not recommended in Kubernetes upstream and EKS believes allowing this is unsafe. Kubernetes is recommending to use cert manger controller instead which is already being discussed here: https://github.com/kubeflow/pipelines/issues/4695. IMO this is the right long term fix.

But given the timeframe, I am not sure if it is feasible to complete this. Since this Kubeflow release does not aim to support 1.22, an alternative for this release is to revert both the PRs and use CSR v1beta1 API and signerName legacy-unknown. This would mean pipelines would only work in K8s 1.21 and below since kubernetes.io/legacy-unknown is not supported in stable v1 API of CSR and hence it will not work for K8s 1.22 and above. https://github.com/kubeflow/pipelines/issues/4695 will need to be addressed for K8s 1.22 and above.

Another alternative is to release 1.5.1 with the right fix i.e. using cert manager if other distributions do not see this as an issue.

Please let us know your thoughts on this

Originally posted by @surajkota in https://github.com/kubeflow/manifests/issues/2165#issuecomment-1063597308

surajkota avatar Mar 10 '22 02:03 surajkota