tart icon indicating copy to clipboard operation
tart copied to clipboard

Nomad Plugin / Driver

Open fkorotkov opened this issue 3 years ago • 5 comments

Tart itself is just a virtualization solution with some nifty features like ability to store images remotely in OCI registries.

But there is a need for some companies to be able to create a cluster of Mac hosts similar to what Anka is doing with their Controller. For this use cases one can implement a driver for Nomad to support isolation via Tart.

fkorotkov avatar Aug 26 '22 13:08 fkorotkov

oh, this would be wonderful. I am looking forward to this. Currently I am writing my own tool which sorta does what this would do, except my solution is bad and Nomad support would be very good.

naikrovek avatar Aug 31 '22 13:08 naikrovek

@naikrovek BTW what's your use case if I may ask? Are you planning to run long living VMs?

fkorotkov avatar Aug 31 '22 14:08 fkorotkov

no, just CI/CD stuff. I want a fresh VM each time and I want the VM destroyed when it is done doing its work. I have a shell script & Go program which accomplish this today and having something that can scale up and down as need comes & goes would be very nice. Maybe Nomad isn't the proper solution for this, I don't know. I'm pretty new to all of this.

naikrovek avatar Aug 31 '22 14:08 naikrovek

Got it. In our case we have M1 Mac Mini hosts being configured as a CI worker and it simply uses Tart as an isolation mechanism. So we don't have a separate Linux worker that connects to a Tart VM on a different remote host to perform CI task.

fkorotkov avatar Aug 31 '22 15:08 fkorotkov

If curious, here's the basics of what I'm doing. This is very new code and is still untested.

I don't know anything about Nomad, so I always think it will do what I need. Hard to find good explanations of what it can do outside of containers.

naikrovek avatar Sep 01 '22 15:09 naikrovek

We've looked into Nomad plugin and it seems too low level. We have couple of concerns:

  1. It's not trivial to configure a cluster securely
  2. Operation of a cluster is also not trivial and require some
  3. Nomad abstractions are limiting. For example, it won't be possible to build functionality like proxying SSH and VNC into VMs.

So we were thinking of creating a separate Tart Cluster piece that will be specifically targeting use case to cluster bunch of Apple Silicon Hardware to run Tart VMs on them...

And we'd like to hear about your requirements! Specifically around usage. Are you planning to use it only via API from your existing system? Do you want in addition a web UI where you can not only see the current state but also click things to create VMs? In case of web UI, do you need SSO/login functionality with different permissions for users?

fkorotkov avatar Dec 07 '22 20:12 fkorotkov

I think all the things you've listed are nice to have. I'm particularly insterested in GH Actions Autoscaling. We are big fans of GH Actions, and we thought about building our small cluster of mac minis/mac studios for our iOS team requirements. While having multiple virtual machines running simultaneously is a solution, I prefer scaling on demand on a particular VM image. The most crucial features are: determining available resources, cloning/creating VMS, starting and stopping them and executing commands (like scheduling GH Actions). A machine-readable format with some basic straightforward authentication and node connection would be enough for an MVP.

dniHze avatar Dec 11 '22 23:12 dniHze

@dniHze we did pretty much the same for Cirrus Runners and we'll try to incorporate these knowledge into Orchard (we started calling this project like that during our internal discussions).

Working on Cirrus Runners uncovered a few caveats about GitHub API for auto-scaling runners. We had to even implement some health checking so Cirrus Runners actively calls GitHub API to see if an agent became idle or hanged.

Hopefully we'll have something you can try in early February. In the meantime feel free to try out Cirrus Runners. 😉

fkorotkov avatar Dec 20 '22 19:12 fkorotkov

That sounds great, will be happy to give it a go if you need some actual dogfooding later.

In regard to your suggestions, I have a few concerns about Cirrus Runners. They are not really related to the ticket, but I think they might be useful:

  1. Lack of on-demand scaling and sizing. BuildJet would be a perfect example of something we would love to see implemented for Cirrus Runners. They give you up to 64 concurrent runners, but they charge you per execution time. It might be not as cost-effective, but that gives you more freedom towards parallelising your checks. For most, feedback time is most important in a workflow.
  2. Lack of pre-setup simulators and platforms. Downloading and setting up simulators on fresh images will increase setup time. GH images are almost perfect, but they are also heavy.

Hope that helps!

dniHze avatar Dec 21 '22 16:12 dniHze

@dniHze thank you for the feedback!

For the 1st point we have Cirrus CI's macos_insances where we have per-second billing with unlimited parallelism. With Cirrus Runners we are experimenting with a different pricing model since it's more familiar for mobile teams and it's easier to project your spending for management.

I would also like to hear about the 2nd point and things you are missing from the standard image? We deliberately tried to start lean and slowly grow the images. The GitHub provided images seemed too overloaded for us and it seemed unreasonable to try to replicate it.

fkorotkov avatar Dec 28 '22 19:12 fkorotkov

We are not proceeding with the Nomad option since we found some blockers and in general setup was too complex and not future proof for things like USB device passthrough and similar. Closing this issue in favor of https://github.com/cirruslabs/tart/issues/372

fkorotkov avatar Dec 28 '22 20:12 fkorotkov