katib icon indicating copy to clipboard operation
katib copied to clipboard

Katib x Kale x Multi-User Kubeflow doesn't support userid injection

Open prashanthb-ai opened this issue 3 years ago • 18 comments

/kind bug

TL;DR: Istio is needed for kubeflow-userid header. Istio is disabled by Kale, because Katib has issues in some edge cases (?).

What steps did you take and what happened:

  • Kale creates Experiment/Trials/Jobs via template
  • Jobs create pods
  • Pods start off in Running
  • Pods end up in Error
  • No pipelines are created
  • Experiment hangs

What did you expect to happen:

  • Kale creates Trials/Jobs via template
  • Jobs create pods
  • Pods start off in Running
    • Pods make requests to ml-pipeline with the right RBAC
    • Envoy injects kubeflow-userid/prefix
    • ml-service authenticates/authorizes request from pods
  • Pods end up in Completed
  • Experiment runs to completion

Anything else you would like to add:

  • I use an envoy filter for kubeflow-userid header injection in multi-user Kubeflow
  • Kale injects sidecar.istio.io/inject=false
  • The kubeflow-user header is never added. None of my Job's pods can authenticate with the ml-pipeline-service

Workarounds:

  1. Write my own Job admission controller
  2. Modify the ml-pipeline-service to parse the user from the JWT token passed through poddefaults
  3. Stand a proxy in front (:8888) of the ml-service that adds the kubeflow-userid header

I'm not thrilled by any of these options. Can you suggest a better workaround?

Environment:

  • Kubeflow version (kfctl version): kfctl v1.2.0-0-gbc038f9
  • ~Minikube version (minikube version)~: kind v0.10.0 go1.15.7 linux/amd64
  • Kubernetes version: (use kubectl version): v1.20.2
  • OS (e.g. from /etc/os-release): Debian

prashanthb-ai avatar Mar 05 '21 03:03 prashanthb-ai

Hi Prashanth,

We are having the same issue. Are hack was the following:

  1. Edit Pipeline profiles composite controller https://github.com/kubeflow/manifests/tree/master/apps/pipeline/upstream/installs/multi-user/pipelines-profile-controller to create a configmap that contains kubeflow-userid as a key, for every profile. For example: "KUBEFLOW_USERID": user_id

  2. Edit template at https://github.com/kubeflow-kale/kale/blob/master/backend/kale/rpc/katib.py to read the environment variables from the new configmap:

- name: {{.Trial}}
  image: {image}
  envFrom:
    - configMapRef:
        name: userid-configmap
 ...
  1. Edit KF pipelines at https://github.com/kubeflow/pipelines/blob/master/backend/api/python_http_client/kfp_server_api/rest.py to read the environment variable and add it to headers before contacting mlpipeline-service. Add following lines to the request() function:
if 'kubeflow-userid' not in headers:
    if 'KUBEFLOW_USERID' in os.environ:
        headers['kubeflow-userid'] = os.environ['KUBEFLOW_USERID']

With this we avoid the need for envoyfilter. If there are better solutions for this, I'd love to hear about them.

d-gol avatar Mar 05 '21 11:03 d-gol

Interesting solution. So you fork and maintain your own libs and images?

What I was thinking was:

  1. Install podDefaults that mount a bearer token into every pod
  2. Use the same podDefaults to expose this bearer token as an env var
  3. Get that env var into the authorization: Bearer <token> http header of the request sent to the pipeline service by the kfp client (if you're using the Arrikto distribution of the Kubeflow Pipelines SDK, setting ML_PIPELINE_SA_TOKEN_PATH will do this for you without code changes)
  4. Run a proxy server in the ml-pipeline-service pod that listens on 8888, parses the bearer token, decodes the JWT and inserts the ns as the kubeflow-userid header before proxying to the ml-service

The only "new" piece of this solution is 4, since 1,2,3 are required to authenticate from a Jupyter notebook to the ml-service anyway.

This is the poddefault

apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
  name: access-ml-pipeline
  namespace: $NS
spec:
  desc: Allow access to Kubeflow Pipelines
  env:
  # This is the value of the environment variable put into pods that need
  # access to the ml-pipeline
  - name: ML_PIPELINE_SA_TOKEN_PATH
    value: /var/run/secrets/ml-pipeline/token
  # Pods bearing this label will get the env configuration above, as well as
  # the volumeMounts below.
  selector:
    matchLabels:
      access-ml-pipeline: "true"
  volumeMounts:
  - mountPath: /var/run/secrets/ml-pipeline
    name: volume-ml-pipeline-token
    readOnly: true
  volumes:
  - name: volume-ml-pipeline-token
    projected:
      sources:
      # This service account is managed by the kubelet transparently. Hence it's
      # left nameless.
      - serviceAccountToken:
          audience: ml-pipeline
          expirationSeconds: 99999
          # This is the path under the mountPath the token is copied into
          path: token

Eg the decoded bearer should look something like this:

{
  "aud": [
    "ml-pipeline"
  ],
  "exp": 1615044345,
  "iat": 1614944346,
  "iss": "https://kubernetes.default.svc.cluster.local",
  "kubernetes.io": {
    "namespace": "<ns>",
    "pod": {
      "name": "<pod name>"
    },
    "serviceaccount": {
      "name": "pipeline-runner"
    }
  },
  "nbf": 1614944346,
  "sub": "system:serviceaccount:<your ns>:pipeline-runner"
}

So you can parse that sub field out and get the user. Actually I believe even giving that entire sub string to the ml-pipeline-service will work, since it's just going to ask the Kubernetes API server whether it can access the pipeline (i.e run kubectl auth -n $NS can-i --as system:serviceaccount:<your ns>:pipeline-runner)

That said, I haven't got this working yet. There's some issue with my admission controller not adding the required poddefaults to the pods created by the Katib Job that I still need to figure out.

I believe this issue is proposing something similar, at a high level: https://github.com/kubeflow/pipelines/issues/5138

The appeal of my solution (should I get it to work) is that I don't need to maintain my own forks of the code. @dejangolubovic how much of a management overhead is this? How often do you rebase? Did you try just using envoy with Katib and seeing what happens?

prashanthb-ai avatar Mar 05 '21 12:03 prashanthb-ai

@prashanthb-ai, very interesting. It is an elegant solution, I like that it does not require additional manual setup. Hope you get it working soon.

For our setup, yes, we maintain our own Kubeflow manifests repo. The idea is to isolate upstream components from our own patches, in order to perform quick updates when necessary. We will open the setup publicly and present it, once we reach level of automation we are satisfied with.

About the overhead, composite controller (step 1 from my solution) does not require any additional setup. Once it's configured to create a configmap, it will do it for every profile and we don't have to intervene.

Steps 2 and 3 are a bit different. We maintain an image with specific versions of Kale and kfp, which will require work once we update to newer versions. We plan to perform updates with every release of Kale and kfp. This will require merging our patch periodically, unless we find a better solution or it's implemented upstream.

About envoyfilter with Katib, I tried and it did not work. Sidecar injection is disabled on Katib pods, so envoyfilter is never created. Using envoyfilter does work with Kale though, but only when submitting pipelines, not Katib jobs. Kale submits pipelines directly from the notebook's pod, which has an istio sidecar and can utilize envoyfilter, unlike Katib pods.

d-gol avatar Mar 05 '21 13:03 d-gol

Very curious to see your setup. I think it's great that you're planning to present/OS/updated it in-line with Kale and kfp.

I'm also starting to get the sense that to use these components reliably one needs to fork and rebase in lock-step with individual component releases. There's too much fragmentation otherwise.

About envoyfilter with Katib, I tried and it did not work. Sidecar injection is disabled on Katib pods, so envoyfilter is never created. Using envoyfilter does work with Kale though, but only when submitting pipelines, not Katib jobs. Kale submits pipelines directly from the notebook's pod, which has an istio sidecar and can utilize envoyfilter, unlike Katib pods.

@dejangolubovic by disabled you mean the sidecar.istio.io/inject=false annotation right? I'm wondering why it's set to false in the first place. Sounds like it is not necessary: https://github.com/kubeflow/katib/issues/1374#issuecomment-721102947 and that the original problem: https://github.com/kubeflow/katib/issues/955 has several potential solutions: https://github.com/istio/istio/issues/11659#issuecomment-462364211. However, it seems to be currently hardcoded by the Kale-Katib template. I think I'll open and X-ref an issue in Kale.

prashanthb-ai avatar Mar 05 '21 15:03 prashanthb-ai

@prashanthb-ai, yes, I meant only this "disable sidecars" annotation. Great, thank you, I wasn't aware of these solutions and that it may be possible to run Katib jobs without disabling istio. I will definitely try our setup only using envoyfilter and without disabling sidecars. Letting you know how it goes.

It is a good idea to open the issue at Kale, making Katib job template configurable would make things easier. Let's see if there are proposals.

d-gol avatar Mar 08 '21 13:03 d-gol

Hi everyone!

Although this is the Katib repository, I'd like to inform you that we (Arrikto) have exposed our design for KFP client authentication in https://github.com/kubeflow/pipelines/issues/5138 and submitted two PRs implementing it: https://github.com/kubeflow/pipelines/pull/5286, https://github.com/kubeflow/pipelines/pull/5287

Feel free to test and provide feedback :)

elikatsis avatar Mar 12 '21 16:03 elikatsis

I would add this is a pervasive issue with Kubeflow. I have a similar issuer with the header but not on Katib. In my case I am trying to remove the envoy filter from the ingressgateway to a SIDECAR_INBOUND context. When you do this everything works fine except for services that require the kubeflow-userid for access. In my case JWA failes to load properly because the header is missing. I think kubeflow by default needs to inject this header or at least expose an option in the KF Manifests to inject this header to all requests after the user has been authenticated. @dejangolubovic does what I was thinking but not modifying the Kubeflow source code, I was thinking more or less injecting a script after authentication and injecting the header from there. This is a legitimate issue because current Kubeflow authentication flow severely limits scope of cluster usability by many of its users. I.E. not everyone has a dedicated cluster for Kubeflow, we have other services running on the cluster and Kubeflow auth breaks other scenarios.

I have an open issue on this but it is not getting traction. https://github.com/kubeflow/manifests/issues/1834

aaron-arellano avatar Apr 28 '21 00:04 aaron-arellano

@aaron-arellano as an intermediate hack, I ended up just disabling header based auth with a sidecar on the ml-pipeline: https://gist.github.com/prashanthb-ai/dd85348b6752f2ef3385947c32a50784

Or course I've only deferred the problem till someone or I have time to figure out a better solution.

Also note that the problem here is that we're locked out from user-id envoy injection because of the hardcoded annotation in Katib.

In your case, I'm wondering if it's possible to route all company traffic through an Uber ingress that routes say Host: kubeflow.yourcommpany.org to the istio-system/istio-ingressgateway service? On my cluster the ingressgateway is just a NodePort service, not an actual ingress.

prashanthb-ai avatar Apr 28 '21 02:04 prashanthb-ai

@prashanthb-ai I tried your method, but not work

Bowen0729 avatar Jun 10 '21 11:06 Bowen0729

@Bowen0729 what did you try? disabling the header auth by routing through a hack proxy as indicated in my previous comment? If so, you need to ensure the proxy is injecting the expected header. The userid header needs to match your profile: https://www.kubeflow.org/docs/components/multi-tenancy/getting-started/#automatic-creation-of-profiles.

You can debug by trying curl commands from a pod -> ml service. eg:

curl ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/experiments -X POST -d '{"name": "katib-6b6coz", "resource_references": [{"key": {"type": "NAMESPACE", "id": "kubeflow-user"}, "relationship": "OWNER"}]}' -H "authorization: Bearer <insert your bearer token - this is probably the token mounted at /var/run/secrets/ml-pipeline/token and/or the value of $ML_PIPELINE_SA_TOKEN_PATH>"

And see how/if it passes.

prashanthb-ai avatar Jun 11 '21 03:06 prashanthb-ai

@prashanthb-ai Thank you for your reply

Actually I 'm from another issue:https://github.com/kubeflow-kale/kale/issues/276#issuecomment-792287143

I have the similar error in that issue, I run katib job by kale, but I got error in pod(image: katib-kfp-trial:a7f7bb79-d9bf99ac):

**File "/usr/local/lib/python3.6/site-packages/kale/common/katibutils.py", line 152, in create_and_wait_kfp_run kwargs ...... File "/usr/local/lib/python3.6/site-packages/http/client.py", line 1264, in putheader if _is_illegal_header_value(values[i]): TypeError: expected string or bytes-like object

and I exec create_and_wait_kfp_run() in my notebook, it run well, that makes me very confused.

Bowen0729 avatar Jun 11 '21 05:06 Bowen0729

@prashanthb-ai Thank you for your reply

Actually I 'm from another issue:kubeflow-kale/kale#276 (comment)

I have the similar error in that issue, I run katib job by kale, but I got error in pod(image: katib-kfp-trial:a7f7bb79-d9bf99ac):

**File "/usr/local/lib/python3.6/site-packages/kale/common/katibutils.py", line 152, in create_and_wait_kfp_run kwargs ...... File "/usr/local/lib/python3.6/site-packages/http/client.py", line 1264, in putheader if _is_illegal_header_value(values[i]): TypeError: expected string or bytes-like object

and I exec create_and_wait_kfp_run() in my notebook, it run well, that makes me very confused.

Hello, my problem is the same as yours. Have you solved it?

longpi1 avatar Sep 27 '21 01:09 longpi1

@prashanthb-ai Thank you for your reply Actually I 'm from another issue:kubeflow-kale/kale#276 (comment) I have the similar error in that issue, I run katib job by kale, but I got error in pod(image: katib-kfp-trial:a7f7bb79-d9bf99ac): **File "/usr/local/lib/python3.6/site-packages/kale/common/katibutils.py", line 152, in create_and_wait_kfp_run kwargs ...... File "/usr/local/lib/python3.6/site-packages/http/client.py", line 1264, in putheader if _is_illegal_header_value(values[i]): TypeError: expected string or bytes-like object and I exec create_and_wait_kfp_run() in my notebook, it run well, that makes me very confused.

Hello, my problem is the same as yours. Have you solved it?

没有,我把kubeflow的多用户都关掉了,就可以了,应该是katib,kale,pipeline之间的多用户没有完全打通,会导致很多莫名其妙的问题

Bowen0729 avatar Sep 28 '21 01:09 Bowen0729

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 03 '22 21:01 stale[bot]

@Bowen0729 @longpi1 Did you solve the issue?

When I first tried to run a Katib job with Kale, the job errored with the following message: Error creating: pods "XXX-e80p0-pklj65dv-" is forbidden: error looking up service account test-user/pipeline-runner: serviceaccount "pipeline-runner" not found The pipeline-runner only exists in the main "kubeflow" namespace and not in the namespace of "test-user". So I created a new service account "pipeline-runner" in "test-user" namespace which finally lead to the creation of the trial pods.

However, now I get the exact error with TypeError: expected string or bytes-like object. I have setup multi-user mode with PodDefault access and injection of KF_PIPELINES_SA_TOKEN_PATH. It seems that it is not allowed to authenticate.

Did anyone figure out how to solve it with Katib jobs? Normal pipeline runs launched from the notebook work fine. It does not work with Katib enabled.

drawesomenic avatar Jan 04 '22 20:01 drawesomenic

@Bowen0729 @longpi1 Did you solve the issue?

When I first tried to run a Katib job with Kale, the job errored with the following message: Error creating: pods "XXX-e80p0-pklj65dv-" is forbidden: error looking up service account test-user/pipeline-runner: serviceaccount "pipeline-runner" not found The pipeline-runner only exists in the main "kubeflow" namespace and not in the namespace of "test-user". So I created a new service account "pipeline-runner" in "test-user" namespace which finally lead to the creation of the trial pods.

However, now I get the exact error with TypeError: expected string or bytes-like object. I have setup multi-user mode with PodDefault access and injection of KF_PIPELINES_SA_TOKEN_PATH. It seems that it is not allowed to authenticate.

Did anyone figure out how to solve it with Katib jobs? Normal pipeline runs launched from the notebook work fine. It does not work with Katib enabled.

I closed the multi user.

Bowen0729 avatar Jan 05 '22 01:01 Bowen0729

Ok, I was actually able to solve the issue now and manage AutoML runs with multi-user setting:

  1. Create Pod Default as described here but add an additional environment as Arrikto uses different env variables (as described above by @prashanthb-ai):
  env:
    - name: KF_PIPELINES_SA_TOKEN_PATH
      value: /var/run/secrets/kubeflow/pipelines/token
    - name: ML_PIPELINE_SA_TOKEN_PATH
      value: /var/run/secrets/kubeflow/pipelines/token
  1. Then simply create a new serviceAccount with the name "pipeline-runner" in the kubeflow user namespace.
  2. Add the cluster role of "kubeflow-edit" to the newly created serviceAccount (note: creating the new role "pipeline-runner" as of kubeflow namespace and binding did not work for me):
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    app.kubernetes.io/component: ml-pipeline
    app.kubernetes.io/name: kubeflow-pipelines
    application-crd-id: kubeflow-pipelines
  name: pipeline-runner-binding
  namespace: kubeflow-user
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kubeflow-edit
subjects:
- kind: ServiceAccount
  name: pipeline-runner
  namespace: kubeflow-user

I hope it helps anyone trying to run AutoML with Katib and Kale in multi-user setting.

drawesomenic avatar Jan 05 '22 19:01 drawesomenic

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '22 06:04 stale[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Sep 13 '23 10:09 github-actions[bot]