livebook icon indicating copy to clipboard operation
livebook copied to clipboard

K8s runtime

Open mruoss opened this issue 1 year ago • 13 comments
trafficstars

Alright, @josevalim asked me if I wanted to contribute a K8s runtime so here it comes. Took me ages to build a UI with a few forms on it!!! 🤣

A few notes

Connection and Authentication

When working with Kubernetes you usually don't ever see connection and authentication details. Instead, you use the helper of your cloud provider which adds all infos into your user's Kubernetes configuration YAML file - on unix systems that's on ~/.kube/config. I therefore decided to rely on an ENV variable pointing to such a Kubeconfig file instead of prompting for credentials. If users want an ad-hoc way of connecting to a cluster, I can add this in a later PR. But I'd rather let people use it and gather some feedback.

Specs

Kubernetes Pod specs contains soo many fields. I'm only allowing to configure the ones I think are most important especially and also for running on GPU nodes (node selector, tolerations, resources, persistent volume). For FLAME I support a BYO Pod manifest. This might be something to add later so that people can paste their YAML instead of filling the form. We would than only override the image and add required ENV vars.

UI

Yeah... please somebody make that form more beautiful! 😜 No seriously, I think it's ok for now but I'm sure somebody could do a better job.

Screenshot 2024-08-19 at 21 55 10

Ready for Review

I'm still gonna run some tests (in-cluster and error handling for missing permissions and other non-happy cases) and maybe commit some fixes but feel free to review what's there.

mruoss avatar Aug 19 '24 19:08 mruoss

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Aug 19 '24 19:08 CLAassistant

Uffizzi Preview deployment-55364 was deleted.

github-actions[bot] avatar Aug 19 '24 20:08 github-actions[bot]

Thank you @mruoss! The PR looks great, I think the biggest questions are exactly the ones you wrote in the issue. Authentication and specifications.

About authentication, the most likely scenario is that I am using my desktop and I want to connect to a k8s cluster my company runs. Would this also be done via ~/.kube/config?

About the specs, I think replicating it in the UI is just too much hassle. I can think of two options:

  1. Templates? Is it possible to specify templates within k8s? If so, maybe we could simply list all deployments and we use the template in that deployment? Then we can provide some documentation on how to create a default deployment.

  2. Another option is to just have a text area with the yaml . We can include a default yaml and we just use that. Later we can even make it use the codemirror editor. I think this may provide the better experience. We can simplify all of the code, users can copy and paste, and make it as complex and as simple they need it to be.

josevalim avatar Aug 19 '24 21:08 josevalim

About authentication, the most likely scenario is that I am using my desktop and I want to connect to a k8s cluster my company runs. Would this also be done via ~/.kube/config?

If the company cluster's API Server is reachable, then it is technically possible, yes. But there are some caveats as I already mentioned earlier. Kubernetes supports several mechanisms of authentication. Some of these mechanism (exec) require a shell for a user that is logged in to the cloud provider (e.g. GCP). These auth mechanisms won't work for let's say livebook dekstop. They only work if you run livebook from the CLI as the user that's logged in to the cloud provider.

There's a workaround for this use case, though. If the user has enough permissions, they can create a K8s Service Account (SA) or use an existing SA and make sure it has all the required permissions (Role and RoleBinding). They can then Create an API Token for the SA and create new cluster, user and context entries in ~/.kube/config or create a new and dedicated kubeconfig file. This is not very difficult but you have to know what you're doing. I could create a mix task that does all of this. But then again, when Livebook is run as Desktop version, there's no mix tasks to the rescue.

To come back to your initial question: I think most/all auth mechanism are supported by Kubereq via Kubeconfig file. But creating UIs for all of them (API Token, Client Cert auth, ...) would be a lot of work. That's the reason I decided to go with the file. It's just the "creating the file" that's the difficult part.

Hope I could make things clear.

About the specs, I think replicating it in the UI is just too much hassle. I can think of two options:

Are you proposing to remove the form completely and only offer a BYO manifest version? Or leave the simple form and offer a more sophisticated alternative?

  1. Templates? Is it possible to specify templates within k8s? If so, maybe we could simply list all deployments and we use the template in that deployment? Then we can provide some documentation on how to create a default deployment.

It is a nice thought. "Deployment" is the wrong term here, because a Deployment actually schedules pods to be run. But I think there is a PodTemplate resource we could use. Users would then have to create a PodTemplate and apply it to the cluster prior to creating the runtime. I'll dig a bit.

  1. Another option is to just have a text area with the yaml . We can include a default yaml and we just use that. Later we can even make it use the codemirror editor. I think this may provide the better experience. We can simplify all of the code, users can copy and paste, and make it as complex and as simple they need it to be.

Yup, that's the thought I had as well. Both options would require users to also care about PVCs. Mixing UI/Form elements with BYO is also going to be too bit of a hassle...

Let me give the options a bit more thought....

mruoss avatar Aug 20 '24 11:08 mruoss

To come back to your initial question: I think most/all auth mechanism are supported by Kubereq via Kubeconfig file. But creating UIs for all of them (API Token, Client Cert auth, ...) would be a lot of work. That's the reason I decided to go with the file. It's just the "creating the file" that's the difficult part.

To be clear, I actually like the file approach. The less we have to do, the better. :) I was just wondering how users would get that file in non-conventional cases.

Are you proposing to remove the form completely and only offer a BYO manifest version? Or leave the simple form and offer a more sophisticated alternative?

Correct. I would get rid of the form altogether. Almost everyone using k8s will be familiar with the yaml config. We can support either YAML, PodTemplates, or both. For PVC, we could still manage it, and either dynamically modify the yaml before applying it OR just change the default YAML we will show based on your PVC config.

josevalim avatar Aug 20 '24 13:08 josevalim

I would get rid of the form altogether.

You've got to be kidding me! There is blood on this form! 😂 Just kidding. It will be much easier this way to maintain. I'll try to find a good way to support YAMLs.

mruoss avatar Aug 20 '24 17:08 mruoss

It will be much easier this way to maintain. I'll try to find a good way to support YAMLs.

Feel free to start with just the textarea and we can do the rest. :)

josevalim avatar Aug 20 '24 18:08 josevalim

Okay, it's just a plain monospace textarea now. Using the codemirror editor instead would be amazing of course! But I didn't look into that yet. I can definitely also add a "load PodTemplate from cluster" functionality. Although I'm not sure this really adds value for anybody.

mruoss avatar Aug 20 '24 20:08 mruoss

@mruoss Fantastic. I think I am happy with most high-level decisions, so we can start jumping into the code review. :)

josevalim avatar Aug 21 '24 08:08 josevalim

Okay. I have one last thing on my mind/list: I think I need to make the timeout (how long to wait for the pod to get scheduled and start running) configurable. In some cases, this can take minues (e.g. downloading the docker image already can take a minute). Also the default probably has to be higher than 15s.

mruoss avatar Aug 22 '24 21:08 mruoss

Perhaps we hust make the timeout arbitrarily high and we allow the user to press the stop button if they want to give up?

josevalim avatar Aug 22 '24 21:08 josevalim

Or that... And find a way to show event stream (e.g. "downloading image...)

mruoss avatar Aug 22 '24 21:08 mruoss

I've pushed a few commits which make the whole feature more robust (checks for all required permissions, error messages in the UI, Pod readiness check as it was flaky before etc).

I have also added some code that watches events on the runtime pod during Pod startup and sends them to the "runtime connect info". This way the user sees what he is used to from Kubernetes (scheduling problems, pending image pull,...). The commit is definitely something to review: 3600f51fb3db5c7abc6d4a37c1ace2dc525257cf

Now I am okay with the feature all-in all with the following exception:

I have put the pod startup wait timeout to 24 hours now. However, I didn't really figure out a good way to abort the connection. What I have implemented now is that the runtime updates (send_update/3) the runtime LV component and assigns the pod_name. If that variable is non-nil in the runtime component, it renders an "abort connection" button which snipes the pod. See 951bfd602592e1f5d6951212727e44f65e7e1fed.

I don't like the solution too much because of the messaging back and also because the user gets an error message saying that the Pod was deleted... So if there's a better way to abort the connection, let me know.

Successfully tested:

  • Local Livebook -> K8s Runtime -> K8s FLAME runner
  • Livebook running in K8s cluster -> K8s Runtime (dedicated pod) -> K8s FLAME runner

EDIT: updated the comment after pushing the commits below.

mruoss avatar Aug 24 '24 19:08 mruoss

Thanks for the review @jonatanklosko

  1. It would be great if we could add an integration test (against a local cluster) for the runtime itself, similar to test/livebook/runtime/fly_test.exs. This way we have an easy way to assert all the connection and communication, which is the most important. It would only run when TEST_KUBECONFIG or similar is set. I would appreciate if you could also add a comment with minimal steps necessary for a matching local k8s setup.

Oh yes, absolutely agree. For flame_k8s_backend I have implemented the setup_all/N callback to check for a local Kind cluster and create one if it is not there etc. It is super handy for me because I just need to run mix test --include integration and the env is setup for me. Not sure though if you want to go that far here?

I can document K8s dev setup using markdown or by implementing a mix task - up to you.

  1. We should have a test for the UI. For this one we want to stub all the requests. Similar test for Fly here.

Sure will do. Just wanted to get to an agreement on the concept / basic implementation first.

  1. All public functions in the codebase should have docs and specs. It may not feel necessary in all cases, but I think sticking to this rule makes the codebase more accessible.

Yeah no problem, on my libs I even go further than this.

  1. We can definitely share some code with the Fly runtime, but I will do that separately!

👍


Other than that: I went through your comments which make sense. Have commented where I see need for discussion / explanation.

mruoss avatar Sep 06 '24 19:09 mruoss

For flame_k8s_backend I have implemented the setup_all/N callback to check for a local Kind cluster and create one if it is not there etc.

Sounds great!

jonatanklosko avatar Sep 07 '24 17:09 jonatanklosko

@jonatanklosko I've added an integration test now. I've also added it to the CI because I think this really helps to catch regression. However, it does slow down the pipeline quite a bit (40-50s to install Kind and the cluster plus >1min to run the test as in the setup_all I have to build and load the docker image to the cluster. WDYT?

mruoss avatar Sep 08 '24 18:09 mruoss

@mruoss honestly I think it's fine to skip it on CI. The tested code is rather encapsulated, so it should be evident when we want to run these tests. The Fly runtime tests are already opt-in, because they require a token. This is definitely a trade-off, but I think slowing the CI feedback by 1min+ is not worth it :)

jonatanklosko avatar Sep 09 '24 06:09 jonatanklosko

OK I will remove it from CI again. Can always put it back later. Also possible in a different pipeline that runs under different conditions.

mruoss avatar Sep 09 '24 11:09 mruoss

I think it's only UI tests left now. Have to find a good way to mock the kubeconf setup first. Mocking the req should then be straight forward...

mruoss avatar Sep 09 '24 20:09 mruoss

@mruoss honestly I think it's fine to skip it on CI.

@jonatanklosko It looks like kind is installed on those GH runners which makes the test run automatically. Shall I add --exclude k8s to mix test in the CI job?

mruoss avatar Sep 09 '24 20:09 mruoss

@mruoss I did small adjustments to the form. I grouped the image picker and template under the same Pod section and moved storage to the bottom, since it's more of an addition to the template. Let me know if this makes sense :)

jonatanklosko avatar Sep 10 '24 10:09 jonatanklosko

@jonatanklosko @josevalim I've added integration and UI test and feel pretty confident now regarding regression. So I think this is ready for a final review. I guess the changelog entry will be created on the 1.14 branch?

Sorry for taking so long but I didn't have that much time to work on it. That being said, I really really enjoyed working on this and refreshing my LiveView knowledge a little. There's been a lot of progress since the last time I've worked on a productive LV project and oh man... the whole framework feels so nice! Even writing tests, mocking calls etc. It all feels clear and easy and I had so much fun.

edit: sorry, found a little refactoring in the back of my head.

mruoss avatar Sep 13 '24 19:09 mruoss

@mruoss thank you! I write the changelog when releasing, so no need to worry about that. Great that you enjoyed working on this and the timing was great, so no worries :)

jonatanklosko avatar Sep 14 '24 13:09 jonatanklosko