katib
katib copied to clipboard
Katib x Kale x Multi-User Kubeflow doesn't support userid injection
/kind bug
TL;DR: Istio is needed for kubeflow-userid header. Istio is disabled by Kale, because Katib has issues in some edge cases (?).
What steps did you take and what happened:
- Kale creates Experiment/Trials/Jobs via template
- Jobs create pods
- Pods start off in Running
- Pods end up in Error
- No pipelines are created
- Experiment hangs
What did you expect to happen:
- Kale creates Trials/Jobs via template
- Jobs create pods
- Pods start off in Running
- Pods make requests to ml-pipeline with the right RBAC
- Envoy injects kubeflow-userid/prefix
- ml-service authenticates/authorizes request from pods
- Pods end up in Completed
- Experiment runs to completion
Anything else you would like to add:
- I use an envoy filter for kubeflow-userid header injection in multi-user Kubeflow
- Kale injects
sidecar.istio.io/inject=false
- The kubeflow-user header is never added. None of my Job's pods can authenticate with the ml-pipeline-service
Workarounds:
- Write my own Job admission controller
- Modify the ml-pipeline-service to parse the user from the JWT token passed through poddefaults
- Stand a proxy in front (:8888) of the ml-service that adds the kubeflow-userid header
I'm not thrilled by any of these options. Can you suggest a better workaround?
Environment:
- Kubeflow version (
kfctl version
): kfctl v1.2.0-0-gbc038f9 - ~Minikube version (
minikube version
)~: kind v0.10.0 go1.15.7 linux/amd64 - Kubernetes version: (use
kubectl version
): v1.20.2 - OS (e.g. from
/etc/os-release
): Debian
Hi Prashanth,
We are having the same issue. Are hack was the following:
-
Edit Pipeline profiles composite controller https://github.com/kubeflow/manifests/tree/master/apps/pipeline/upstream/installs/multi-user/pipelines-profile-controller to create a configmap that contains kubeflow-userid as a key, for every profile. For example:
"KUBEFLOW_USERID": user_id
-
Edit template at https://github.com/kubeflow-kale/kale/blob/master/backend/kale/rpc/katib.py to read the environment variables from the new configmap:
- name: {{.Trial}}
image: {image}
envFrom:
- configMapRef:
name: userid-configmap
...
- Edit KF pipelines at https://github.com/kubeflow/pipelines/blob/master/backend/api/python_http_client/kfp_server_api/rest.py to read the environment variable and add it to headers before contacting mlpipeline-service. Add following lines to the request() function:
if 'kubeflow-userid' not in headers:
if 'KUBEFLOW_USERID' in os.environ:
headers['kubeflow-userid'] = os.environ['KUBEFLOW_USERID']
With this we avoid the need for envoyfilter. If there are better solutions for this, I'd love to hear about them.
Interesting solution. So you fork and maintain your own libs and images?
What I was thinking was:
- Install podDefaults that mount a bearer token into every pod
- Use the same podDefaults to expose this bearer token as an env var
- Get that env var into the
authorization: Bearer <token>
http header of the request sent to the pipeline service by the kfp client (if you're using the Arrikto distribution of the Kubeflow Pipelines SDK, setting ML_PIPELINE_SA_TOKEN_PATH will do this for you without code changes) - Run a proxy server in the ml-pipeline-service pod that listens on 8888, parses the bearer token, decodes the JWT and inserts the ns as the
kubeflow-userid
header before proxying to the ml-service
The only "new" piece of this solution is 4, since 1,2,3 are required to authenticate from a Jupyter notebook to the ml-service anyway.
This is the poddefault
apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
name: access-ml-pipeline
namespace: $NS
spec:
desc: Allow access to Kubeflow Pipelines
env:
# This is the value of the environment variable put into pods that need
# access to the ml-pipeline
- name: ML_PIPELINE_SA_TOKEN_PATH
value: /var/run/secrets/ml-pipeline/token
# Pods bearing this label will get the env configuration above, as well as
# the volumeMounts below.
selector:
matchLabels:
access-ml-pipeline: "true"
volumeMounts:
- mountPath: /var/run/secrets/ml-pipeline
name: volume-ml-pipeline-token
readOnly: true
volumes:
- name: volume-ml-pipeline-token
projected:
sources:
# This service account is managed by the kubelet transparently. Hence it's
# left nameless.
- serviceAccountToken:
audience: ml-pipeline
expirationSeconds: 99999
# This is the path under the mountPath the token is copied into
path: token
Eg the decoded bearer should look something like this:
{
"aud": [
"ml-pipeline"
],
"exp": 1615044345,
"iat": 1614944346,
"iss": "https://kubernetes.default.svc.cluster.local",
"kubernetes.io": {
"namespace": "<ns>",
"pod": {
"name": "<pod name>"
},
"serviceaccount": {
"name": "pipeline-runner"
}
},
"nbf": 1614944346,
"sub": "system:serviceaccount:<your ns>:pipeline-runner"
}
So you can parse that sub
field out and get the user. Actually I believe even giving that entire sub
string to the ml-pipeline-service will work, since it's just going to ask the Kubernetes API server whether it can access the pipeline (i.e run kubectl auth -n $NS can-i --as system:serviceaccount:<your ns>:pipeline-runner
)
That said, I haven't got this working yet. There's some issue with my admission controller not adding the required poddefaults to the pods created by the Katib Job that I still need to figure out.
I believe this issue is proposing something similar, at a high level: https://github.com/kubeflow/pipelines/issues/5138
The appeal of my solution (should I get it to work) is that I don't need to maintain my own forks of the code. @dejangolubovic how much of a management overhead is this? How often do you rebase? Did you try just using envoy with Katib and seeing what happens?
@prashanthb-ai, very interesting. It is an elegant solution, I like that it does not require additional manual setup. Hope you get it working soon.
For our setup, yes, we maintain our own Kubeflow manifests repo. The idea is to isolate upstream components from our own patches, in order to perform quick updates when necessary. We will open the setup publicly and present it, once we reach level of automation we are satisfied with.
About the overhead, composite controller (step 1 from my solution) does not require any additional setup. Once it's configured to create a configmap, it will do it for every profile and we don't have to intervene.
Steps 2 and 3 are a bit different. We maintain an image with specific versions of Kale and kfp, which will require work once we update to newer versions. We plan to perform updates with every release of Kale and kfp. This will require merging our patch periodically, unless we find a better solution or it's implemented upstream.
About envoyfilter with Katib, I tried and it did not work. Sidecar injection is disabled on Katib pods, so envoyfilter is never created. Using envoyfilter does work with Kale though, but only when submitting pipelines, not Katib jobs. Kale submits pipelines directly from the notebook's pod, which has an istio sidecar and can utilize envoyfilter, unlike Katib pods.
Very curious to see your setup. I think it's great that you're planning to present/OS/updated it in-line with Kale and kfp.
I'm also starting to get the sense that to use these components reliably one needs to fork and rebase in lock-step with individual component releases. There's too much fragmentation otherwise.
About envoyfilter with Katib, I tried and it did not work. Sidecar injection is disabled on Katib pods, so envoyfilter is never created. Using envoyfilter does work with Kale though, but only when submitting pipelines, not Katib jobs. Kale submits pipelines directly from the notebook's pod, which has an istio sidecar and can utilize envoyfilter, unlike Katib pods.
@dejangolubovic by disabled you mean the sidecar.istio.io/inject=false
annotation right? I'm wondering why it's set to false in the first place. Sounds like it is not necessary: https://github.com/kubeflow/katib/issues/1374#issuecomment-721102947 and that the original problem: https://github.com/kubeflow/katib/issues/955 has several potential solutions: https://github.com/istio/istio/issues/11659#issuecomment-462364211. However, it seems to be currently hardcoded by the Kale-Katib template. I think I'll open and X-ref an issue in Kale.
@prashanthb-ai, yes, I meant only this "disable sidecars" annotation. Great, thank you, I wasn't aware of these solutions and that it may be possible to run Katib jobs without disabling istio. I will definitely try our setup only using envoyfilter and without disabling sidecars. Letting you know how it goes.
It is a good idea to open the issue at Kale, making Katib job template configurable would make things easier. Let's see if there are proposals.
Hi everyone!
Although this is the Katib repository, I'd like to inform you that we (Arrikto) have exposed our design for KFP client authentication in https://github.com/kubeflow/pipelines/issues/5138 and submitted two PRs implementing it: https://github.com/kubeflow/pipelines/pull/5286, https://github.com/kubeflow/pipelines/pull/5287
Feel free to test and provide feedback :)
I would add this is a pervasive issue with Kubeflow. I have a similar issuer with the header but not on Katib. In my case I am trying to remove the envoy filter from the ingressgateway
to a SIDECAR_INBOUND
context. When you do this everything works fine except for services that require the kubeflow-userid
for access. In my case JWA failes to load properly because the header is missing. I think kubeflow by default needs to inject this header or at least expose an option in the KF Manifests to inject this header to all requests after the user has been authenticated. @dejangolubovic does what I was thinking but not modifying the Kubeflow source code, I was thinking more or less injecting a script after authentication and injecting the header from there. This is a legitimate issue because current Kubeflow authentication flow severely limits scope of cluster usability by many of its users. I.E. not everyone has a dedicated cluster for Kubeflow, we have other services running on the cluster and Kubeflow auth breaks other scenarios.
I have an open issue on this but it is not getting traction. https://github.com/kubeflow/manifests/issues/1834
@aaron-arellano as an intermediate hack, I ended up just disabling header based auth with a sidecar on the ml-pipeline: https://gist.github.com/prashanthb-ai/dd85348b6752f2ef3385947c32a50784
Or course I've only deferred the problem till someone or I have time to figure out a better solution.
Also note that the problem here is that we're locked out from user-id envoy injection because of the hardcoded annotation in Katib.
In your case, I'm wondering if it's possible to route all company traffic through an Uber ingress that routes say Host: kubeflow.yourcommpany.org
to the istio-system/istio-ingressgateway
service? On my cluster the ingressgateway
is just a NodePort
service, not an actual ingress.
@prashanthb-ai I tried your method, but not work
@Bowen0729 what did you try? disabling the header auth by routing through a hack proxy as indicated in my previous comment? If so, you need to ensure the proxy is injecting the expected header. The userid
header needs to match your profile: https://www.kubeflow.org/docs/components/multi-tenancy/getting-started/#automatic-creation-of-profiles.
You can debug by trying curl commands from a pod -> ml service. eg:
curl ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/experiments -X POST -d '{"name": "katib-6b6coz", "resource_references": [{"key": {"type": "NAMESPACE", "id": "kubeflow-user"}, "relationship": "OWNER"}]}' -H "authorization: Bearer <insert your bearer token - this is probably the token mounted at /var/run/secrets/ml-pipeline/token and/or the value of $ML_PIPELINE_SA_TOKEN_PATH>"
And see how/if it passes.
@prashanthb-ai Thank you for your reply
Actually I 'm from another issue:https://github.com/kubeflow-kale/kale/issues/276#issuecomment-792287143
I have the similar error in that issue, I run katib job by kale, but I got error in pod(image: katib-kfp-trial:a7f7bb79-d9bf99ac):
**File "/usr/local/lib/python3.6/site-packages/kale/common/katibutils.py", line 152, in create_and_wait_kfp_run kwargs ...... File "/usr/local/lib/python3.6/site-packages/http/client.py", line 1264, in putheader if _is_illegal_header_value(values[i]): TypeError: expected string or bytes-like object
and I exec create_and_wait_kfp_run() in my notebook, it run well, that makes me very confused.
@prashanthb-ai Thank you for your reply
Actually I 'm from another issue:kubeflow-kale/kale#276 (comment)
I have the similar error in that issue, I run katib job by kale, but I got error in pod(image: katib-kfp-trial:a7f7bb79-d9bf99ac):
**File "/usr/local/lib/python3.6/site-packages/kale/common/katibutils.py", line 152, in create_and_wait_kfp_run kwargs ...... File "/usr/local/lib/python3.6/site-packages/http/client.py", line 1264, in putheader if _is_illegal_header_value(values[i]): TypeError: expected string or bytes-like object
and I exec create_and_wait_kfp_run() in my notebook, it run well, that makes me very confused.
Hello, my problem is the same as yours. Have you solved it?
@prashanthb-ai Thank you for your reply Actually I 'm from another issue:kubeflow-kale/kale#276 (comment) I have the similar error in that issue, I run katib job by kale, but I got error in pod(image: katib-kfp-trial:a7f7bb79-d9bf99ac): **File "/usr/local/lib/python3.6/site-packages/kale/common/katibutils.py", line 152, in create_and_wait_kfp_run kwargs ...... File "/usr/local/lib/python3.6/site-packages/http/client.py", line 1264, in putheader if _is_illegal_header_value(values[i]): TypeError: expected string or bytes-like object and I exec create_and_wait_kfp_run() in my notebook, it run well, that makes me very confused.
Hello, my problem is the same as yours. Have you solved it?
没有,我把kubeflow的多用户都关掉了,就可以了,应该是katib,kale,pipeline之间的多用户没有完全打通,会导致很多莫名其妙的问题
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@Bowen0729 @longpi1 Did you solve the issue?
When I first tried to run a Katib job with Kale, the job errored with the following message:
Error creating: pods "XXX-e80p0-pklj65dv-" is forbidden: error looking up service account test-user/pipeline-runner: serviceaccount "pipeline-runner" not found
The pipeline-runner only exists in the main "kubeflow" namespace and not in the namespace of "test-user".
So I created a new service account "pipeline-runner" in "test-user" namespace which finally lead to the creation of the trial pods.
However, now I get the exact error with TypeError: expected string or bytes-like object
. I have setup multi-user mode with PodDefault access and injection of KF_PIPELINES_SA_TOKEN_PATH. It seems that it is not allowed to authenticate.
Did anyone figure out how to solve it with Katib jobs? Normal pipeline runs launched from the notebook work fine. It does not work with Katib enabled.
@Bowen0729 @longpi1 Did you solve the issue?
When I first tried to run a Katib job with Kale, the job errored with the following message:
Error creating: pods "XXX-e80p0-pklj65dv-" is forbidden: error looking up service account test-user/pipeline-runner: serviceaccount "pipeline-runner" not found
The pipeline-runner only exists in the main "kubeflow" namespace and not in the namespace of "test-user". So I created a new service account "pipeline-runner" in "test-user" namespace which finally lead to the creation of the trial pods.However, now I get the exact error with
TypeError: expected string or bytes-like object
. I have setup multi-user mode with PodDefault access and injection of KF_PIPELINES_SA_TOKEN_PATH. It seems that it is not allowed to authenticate.Did anyone figure out how to solve it with Katib jobs? Normal pipeline runs launched from the notebook work fine. It does not work with Katib enabled.
I closed the multi user.
Ok, I was actually able to solve the issue now and manage AutoML runs with multi-user setting:
- Create Pod Default as described here but add an additional environment as Arrikto uses different env variables (as described above by @prashanthb-ai):
env:
- name: KF_PIPELINES_SA_TOKEN_PATH
value: /var/run/secrets/kubeflow/pipelines/token
- name: ML_PIPELINE_SA_TOKEN_PATH
value: /var/run/secrets/kubeflow/pipelines/token
- Then simply create a new serviceAccount with the name "pipeline-runner" in the kubeflow user namespace.
- Add the cluster role of "kubeflow-edit" to the newly created serviceAccount (note: creating the new role "pipeline-runner" as of kubeflow namespace and binding did not work for me):
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
app.kubernetes.io/component: ml-pipeline
app.kubernetes.io/name: kubeflow-pipelines
application-crd-id: kubeflow-pipelines
name: pipeline-runner-binding
namespace: kubeflow-user
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kubeflow-edit
subjects:
- kind: ServiceAccount
name: pipeline-runner
namespace: kubeflow-user
I hope it helps anyone trying to run AutoML with Katib and Kale in multi-user setting.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.