dagger Compute Drivers

This issue was previously named "engine drivers", but "compute drivers" is proving more clear.

Problem

The Dagger CLI has a builtin “compute driver”: a software interface which allows it to install, run and manage a the Dagger Engine as a container. This compute driver cannot be customized or swapped out: it always uses the docker CLI. Users who need to customize how or where the engine container is run, simply cannot use it. Instead they must manage the engine container themselves, which is cumbersome and will only get worse as the engine becomes more ephemeral.

Solution

Allow users of the Dagger CLI to swap out the default compute driver, by either selecting a builtin alternative, or loading their own.

Use cases

This is a non-exhaustive list of known use cases for engine drivers:

Alternative container runtimes

The builtin engine driver invokes the docker CLI. A common request is to run Dagger without a hard dependency on docker. Alternatives that have been requested include:

Podman
Nerdctl
Kubernetes
Firecracker

Remote runner machines

An increasingly popular production architecture is to run the Dagger Engine on remote machines, and configure the CLI to connect to them remotely. Currently this requires an experimental environment variable: EXPERIMENTAL_DAGGER_RUNNER_HOST. Compute drivers could be used to stabilize this feature: each driver would be responsible for discovering, connecting to, and establishing a secure channel to, the remote compute.

Developing the Dagger Engine or CLI

When developing the Dagger engine or CLI, running the dev version involves a wrapper script to make sure the Dagger CLI runs the correct version of the engine. This wrapper could be replaced by a special "dev mode" compute driver.

Design considerations

Standardize on containerd?

As far as I know, the consensus is to standardize on containerd as the compute interface for running the engine container. In this design, a compute driver would be responsible for giving the CLI access to a containerd daemon which it can use to run engine containers.

Open questions:

Is the above still the consensus?
How much work is required for the engine internals to make this possible?
Do we wish to ship a stopgap version of this feature before the standardization on containerd is complete? And if so, what does that look like?

cc @jedevc @sipsma

Aug 04 '23 19:08 shykes

FYI @gerhard

Aug 07 '23 06:08 shykes

Thoughts on the problem

I definitely see the problem.

As soon as the CLI stops managing the Engine, the burden falls on the "ops person" to ensure that the CLI/SDK run against the correct Engine version. This brings further complications, e.g.:

how do we know what Engine version the CLI needs?
how do we prevent incompatible versions running together?

Ultimately, this results in poor DX, which I know is super important to us.

Thoughts on the solution

Allowing users to select from various built-in Engine provisioning mechanisms is essential for good DX. docker seems like an important built-in provisioning mechanism - not sure about the default one, but let's put a pin in this for now.

A built-in k8s provisioner is something that I would be excited to contribute.

As for a built-in fly provisioner, this could be a good start: https://github.com/thechangelog/changelog.com/pull/471/files#diff-1fd071d093832bcb6c4a3a4c9d5b3740a0a71a68a42a8a52b422e43e7c84f12c

Thoughts on the design / implementation

I think of this as client-side Dagger pipelines since there is no Engine to run these provisioners in. I can also imagine a world where we have an Engine built into the CLI by default, but I remember us looking at this in the past, and it not being straightforward enough. FTR:

https://github.com/dagger/dagger/issues/3058
https://github.com/dagger/dagger/pull/3187

The reason why I strongly believe that we should approach the design as a DAG problem is because:

resilient setups will require a fall-back mechanism
- e.g. try A first, but fall-back to B when A fails (there is no doubt in my mind that it will at some point)
some teams will want scatter-gather
- e.g. provision A & B, first Engine runtime to complete my pipeline wins, I don't care which one
we will be able to test against multiple implementations at the same time
- we are already using Dagger in Dagger - (see this private discussion https://github.com/orgs/dagger/discussions/2463#discussioncomment-6121115) - I see a lot of value in being able to do both
we should be able to test against multiple versions at the same time
- one CLI to rule them all maybe 💍

I have already implemented the first scenario in GitHub Actions YAML, and the DX didn't feel right: https://github.com/thechangelog/changelog.com/blob/6bf40eed0a0cc6400f69b30bf4fe45dc7e419431/.github/workflows/ship_it.yml . This outcome was worth it though:

The above is the provisioning DAG that I think I would really enjoy implementing with Dagger one day 💪

In terms of design / implementation, what are your thoughts on a Zenith-based extension @shykes?

Aug 07 '23 14:08 gerhard

:wave: another use-case for this discussion about engine autostart once the machine reboots but before connecting to a VPN which breaks networking:

https://discord.com/channels/707636530424053791/1138365590290038794

Aug 08 '23 12:08 marcosnils

🙌 having a custom engine without hassling a custom wrapper would be great.

one CLI to rule them all maybe 💍

☝️ I would love to see this. This would also solve the issue overriding the engine container if you have two different versions of the dagger when you're executing a pipeline.

Aug 16 '23 09:08 aweris

@gerhard I think it's a cool idea to have dagger pipelines to provision the dagger engine, but I think we should avoid making that a core requirement. I think an engine driver should be something you can implement with just a shell script. Then of course, it should also be possible to create an engine driver that under the hood runs dagger pipelines on a bootstrap engine.

Aug 17 '23 18:08 shykes

I updated the issue with known use cases.

Aug 17 '23 18:08 shykes

👋 @jedevc

Sep 19 '23 17:09 shykes

FYI @gerhard @sipsma @jedevc I gave this issue a fresh coat of paint, so we can use it to advance the discussion started in #6486 .

Renamed "engine driver" to the more clear "compute driver".
Removed my conversation-starter proposed design, which failed to start the conversation :)

Feb 05 '24 20:02 shykes

@jedevc I'm picturing this compute driver as squarely standardized on containerd. You mentioned a possible stopgap, that would allow us to ship something to users before the full engine refactoring is complete. Could you share your latest thoughts on this?

Feb 05 '24 20:02 shykes

Hey, sorry this has sat for so long - I tried to get some stuff kickstarted in #6288, to provide the foundations for this work (I'll try and kick that today, and get it merged soon).

As far as I know, the consensus is to standardize on containerd as the compute interface for running the engine container.

Ok, so sort of. To quote from https://github.com/containerd/containerd/blob/main/SCOPE.md:

containerd is scoped to a single host and makes assumptions based on that fact. It can be used to build things like a node agent that launches containers but does not have any concepts of a distributed system.

There is (by design) no remote containerd API - so we can't just "expose a containerd socket" and speak the containerd API, we need a layer on top of this.

I see a couple of potential ways forward:

Build our own "shim" layer - create a super simple API that has one core method: "start me a dagger engine with version <X>".
Use the CRI https://github.com/containerd/containerd/blob/main/docs/cri/architecture.md (how k8s interacts with c8d)
- There's a promise of stability here, since it's used by k8s - but it's really not designed for outside consumers :cry:
- It looks like it can pretty much do everything we want https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/crictl.md
- We'd probably need to do some security machinations here, so we don't just allow clients to mount the entire host system into containers, and cause absolute havoc.

I don't think running the CRI is that much more work tbf, so I'd be tempted to push in that direction.

[!NOTE] We'd be using the CRI API from the dagger cli to spawn dagger engines, but still using the containerd API from within buildkit. Potentially, it could be fun to have buildkit implement a "CRI worker" one day, but this maayyyy be tricky, since CRI can be remote (and pretty much all of buildkit assumes that a single host - though we need to break this assumption to do clustered buildkit one day anyways).

Before going too far down the CRI idea, we'd need to check that this definitely works, and that it definitely can be used remotely :thinking:

So assuming something like the CRI, what would this look like? An illustrative example:

$ dagger-compute-driver --containerd-version=1.7.13 connect
-> send data to stdin to send it to CRI
<- recv data from stdout to receive it from CRI
<- recv data from stderr to get logs from the compute driver

Now, we have a gRPC CRI API! We can use this to start/stop containers, pull images, etc.

But, suppose we start a dagger engine container - how do we connect to it?

Potentially, for cases where we know we can create a direct access (like containerd-in-docker), we could do this:

$ dagger-compute-driver --containerd-version=1.7.13 access <container-name>
unix:///var/run/dagger-XXXXXXXX.sock

The compute driver for docker could start the containerd service, and map all the unix sockets to the host, so we would know how to access it (note, this only works for docker engines running locally!)

For a potential service that had a dagger-compute-driver, this might be implemented over the internet:

$ dagger-compute-driver --containerd-version=1.7.13 access <container-name>
tcp://my-dagger-service.com:1234

The compute driver for this service would start the containerd service directly on the host, and then the CLI (through the CRI) can ask that the ports would be exposed on that endpoint.

Then, suppose we had a much more complicated case, where we didn't have an access implementation:

$ dagger-compute-driver --containerd-version=1.7.13 access <container-name>
(exit status code 1)

For this case, we could create a tunnel through the CRI, using either PortForward or Attach for these https://github.com/kubernetes/cri-api/blob/c75ef5b473bbe2d0a4fc92f82235efd665ea8e9f/pkg/apis/runtime/v1/api.proto#L95-L98. These would only be a backup option if access wasn't implemented, since it has pretty substantial perf implications.

Do we wish to ship a stopgap version of this feature before the standardization on containerd is complete?

Agh - so, we kind of could, it would be very simple to add a little command line adapter to #6288.

Here's an illustrative example what this could look like pre-engine rearchitecture:

$ dagger-compute-driver --dagger-version=0.9.8 connect
-> send data to stdin to send it to dagger
<- recv data from stdout to receive it from dagger
<- recv data from stderr to get logs from the compute driver

Instead of having a --containerd-version, we have a direct --dagger-version. Instead of spinning up a containerd API, the driver is instead responsible for spinning up a dagger instance directly.

We could do this pretty easily - whether we should or not depends on a couple of non-technical constraints:

These drivers would be incompatible with future engine/cli releases. This is a bit annoying, if anyone built drivers, we'd need them to go and update them later.
- However, we could have a cross-over period, and allow the CLI to support both new containerd drivers, and the "legacy" direct dagger connections.
Should we rush this so that people can build drivers? Is this a priority for right now? (in comparison to modules, the containerd work, etc)
- If this unblocks some fun opportunities to engage with community/partners, maybe it's worth it ASAP!
- I think the technical side of this minimal implementation is small, but comms+docs are likely to be the bigger part (as well as having somewhere where we can show off all the integrations).

Renamed "engine driver" to the more clear "compute driver".

Bikeshedding the name of this is probably not the most important part at this stage, but - to me compute drivers conjures images of other drivers - like network drivers, storage drivers. None of these really make sense in terms of the engine.

I think engine driver still makes the most sense to me, but would also be happy with:

backend driver
external driver

Feb 06 '24 13:02 jedevc

One other thought about this - one of the big open questions around drivers for me is how we should handle lifetimes.

Who manages an engine's lifetime?

Does the CLI? This doesn't work super nicely with multiple CLIs, potentially across lots of different clients. Also, not super great if the client crashes and can't clean up after itself.
- This is today's behavior, and it's not great - the garbage collection behavior is really weird, since 1. it doesn't run at deterministic intervals, and 2. it doesn't allow multiple engines of different versions running at the same time.
Does the compute driver? This seems like a good place to put the logic, but also this is invoked by the CLI, so has those same disadvantages.
Does the engine? The engine could just exit after a while of running, but this is a pretty hefty change, and makes much more sense when we have the containerd-related refactorings.
Does the backend for the compute driver (like docker, podman, kubernetes) manage this? This feels nice, since it's related to the compute driver, but would require doing some fancy orchestration with the engine.

Or we could simplify this and just kill engines as soon as they're not running jobs, but this does mean that then we actually need to work out how we can avoid the caching service bind mount syncronization at every startup.

Feb 06 '24 13:02 jedevc

Just had a scratch idea, could we do all the CRI magic without needing a huge engine refactor? The thing we didn't like before was the idea of doing lots of extra nesting - but in reality, I wonder if this is actually as non-performant/weird as we think - it's just namespaces. If we're okay with the nesting, we could do this really easily:

We run containerd (with CRI plugin) using docker
We run dagger using the CRI
The dagger engine runs runc containers

The only extra step is the CRI layer. We could just... nest one more time. This would avoid needing to refactor everything, and do all the weird containerd mounting (with rootfs propagation), and refactor all of our networking.

I might try taking some measurements here if this seems interesting - it would potentially avoid us needing to spend months on a huge low-level refactor which would likely add lots of nasty bugs.

Feb 06 '24 18:02 jedevc

Who manages an engine's lifetime?

My assumption has been the CLI. But important that we discuss this vigorously, and get consensus. It affects everything else in the design.

Does the CLI? This doesn't work super nicely with multiple CLIs, potentially across lots of different clients. Also, not super great if the client crashes and can't clean up after itself.

I think having the CLI manage it is the best option. Managing the engine's lifecycle is delicate to get right, with lots of edge cases to worry about (how to handle crashed clients being one of them). IMO that should only implemented once - which means in the CLI. Pushing it to drivers would be too much work per driver.

This is today's behavior, and it's not great - the garbage collection behavior is really weird, since 1. it doesn't run at deterministic intervals,

I don't have a good answer to garbage collecting, but that seems solvable (the process inside the container could terminate after a set period; the inert container itself could be garbage-collected later)

and 2. it doesn't allow multiple engines of different versions running at the same time.

I didn't understand this part. Couldn't be add to the CLI support for running different versions of the engine?

Does the compute driver? This seems like a good place to put the logic, but also this is invoked by the CLI, so has those same disadvantages.

All the downsides of CLI + extra redundant work per driver.

Does the engine? The engine could just exit after a while of running, but this is a pretty hefty change, and makes much more sense when we have the containerd-related refactorings.

Engine can't clean up its own container.. Can't manage multiple versions of itself... Can't decide whether to reuse a pre-existing container vs. start a new one.

Does the backend for the compute driver (like docker, podman, kubernetes) manage this? This feels nice, since it's related to the compute driver, but would require doing some fancy orchestration with the engine.

Compute driver should be responsible for provisioning the compute (up to containerd/CRI(?)). But IMO not responsible for what runs on top.

Or we could simplify this and just kill engines as soon as they're not running jobs, but this does mean that then we actually need to work out how we can avoid the caching service bind mount syncronization at every startup.

I don't have the answer to this, but the CLI seems like the best place to try and implement a solution.

Feb 06 '24 18:02 shykes

Potentially it's this simple:

CLIs create dagger engines, with a timeout parameter
After the engine has exceeded a timeout without a new connection, it shuts itself down (this stops it's own container)

However, we probably want to allow each engine driver to set the "timeout". e.g. for local docker instances, there's no harm in having this be quite long, but for a cloud-compute backend, where each instance could be costing money, we'd probably want to have this set lower (minutes instead of hours).

We could do this through another command to the driver something like dagger-compute-driver config, which would output a JSON blob to allow custom configuring some of the connection behavior in the CLI.

Feb 06 '24 18:02 jedevc

@jedevc the CRI + nesting approach seems promising! Take my opinion with a grain of salt since I'm not aware of all the complications of either nesting or containerd backend change. But I'm curious to find out more! wdyt @sipsma ? Worth investigating this while you're on vacation?

Feb 08 '24 01:02 shykes

I am not convinced about adding more layers to our stack.

Today, most of us run a Dagger Engine container in Docker which runs in a VM. Some run Docker on Linux directly, and skip the VM layer. Either way, we end up with:

Docker (container runtime)
|- Dagger Engine (container)

If we standardise on containerd as the default runtime, then we would have:

Docker (external container runtime)
|- containerd (internal container runtime)
   |- Dagger Engine (container)

This makes me wonder about:

Who is responsible for managing containerd? CVEs, new features, upgrades, etc.
How do we make resource allocation fair when there are multiple Engines running? The noisy neighbours problem will only get worse - we now have multiple Engines each with its own set of noisy neighbours. If we set quotas per Engine, how are we going to handle multiple Engines needing more quota than containerd has available?
What impact does adding containerd have on services in Dagger?

Instead of following on the containerd path, I am tempted to redirect this conversation towards what it would take to implement the following DX:

dagger provision \
  [--runtime=docker|podman|kubernetes|fly|depot...] \
  [--runtime-version=0.9.10]

# default runtime: docker
# default version: CLI version

This DX implies support for:

DAGGER_RUNTIME
DAGGER_RUNTIME_VERSION

I would also introduce DAGGER_HOST (an iteration on _EXPERIMENTAL_DAGGER_RUNNER_HOST) and only support unix:// and tcp://.

Are we OK to continue in this direction, or should we sync on the containerd approach @jedevc? If you feel strongly about it, I am more than happy to load synchronously - 1:1 - on the containerd & CRI context which I may be missing.

The stopgap that I am thinking is implementing compute drivers via modules. We still need an Engine to run these, but we already have that today, so this wouldn't require any changes in the Engine itself. What are your thoughts @shykes on a compute driver module?

Feb 12 '24 19:02 gerhard

@gerhard the idea is that almost always, there is already a containerd in the stack. It's the most ubiquitous and reliable container runtime imout there. With CRI we potentially keep the option of also supporting other runtimes (I think? @jedevc ?). So I don't think it's accurate that it would add layers to the stack.

Feb 15 '24 08:02 shykes

I think there's two pieces of work here that are both important to clarify:

The compute drivers work - this allows the CLI to have pluggable components that allow running dagger anywhere.
The stateless engine work (the containerd stuff discussed above) - this allows us to collapse versioning into the CLI

While scoping this out, I wondered - could we just ignore the stateless engine and have all this logic in the compute driver? I think that's what you're suggesting @gerhard.

I think this works for super simple setups, like docker or podman - but starts to work less well when trying to deploy dagger at scale: e.g. how should I deploy a helm chart in my kubernetes cluster that manages dagger? If I have to pick a version of dagger, then the CLI can be out of sync - which is the whole thing we want to avoid.

So here are the ways I can see of "collapsing" versioning:

Compute driver is always in charge - it connects directly to a backend resource (like kubernetes), and creates a dagger instance (like in a pod) manually.
- This is similar to buildx. I really do not love this model, it means that 1. every user of dagger in this cluster needs access to the kubernetes cluster, you can't just expose a service to connect to, and 2. there's no centralization, so the ops team has no real ability to manage/upgrade/observe the dagger service.
- Based on this, I think most people just opt for managing native buildkit themselves, and connecting using the buildx remote driver, instead of the kubernetes driver.
Compute driver is in charge, but connects to a backend helper (like a kubernetes operator).
- This improves on the above, by allowing for some centralization and management, and doesn't require giving away any/as many permissions.
- But now, we need to go write a kubernetes operator (or similar). That's another component to maintain, and also only works for kubernetes. What about running in nomad, or docker swarm, or even ECS/etc? Then we'd require having to write an explicit "operator" tool for each of those to support it.
Do the containerd trick
- Wonderfully simple deployment -now, all you do is deploy containerd-in-a-container, and we spin up dagger on demand for that.
- It's like the operator idea, but now, we only write for 1 backend - containerd/CRI.

The last option seems like the most reasonable to me - but maybe there's some other options I'm missing.

Who is responsible for managing containerd? CVEs, new features, upgrades, etc.

Depends - I suspect we should ship a "dagger runner helper", and bundle containerd ourselves. This would be a separate component, we version it separately, and track containerd releases as closely as possible.

For advanced use cases, users can always run it themselves.

How do we make resource allocation fair when there are multiple Engines running?

I'm not sure this affects much. The engine itself should actually have relatively low overhead, the main expense is in the build containers that are actually being run: and in this model, we're still running roughly the same number.

Eventually, we should aim to specify limits not for the engine, but for the build containers that are spun up on demand. This requires new config options, new apis, etc, but IMO, this would be the way forward.

What impact does adding containerd have on services in Dagger?

Depends on whether we're okay with that extra layer of nesting. If we have it, should not really have any affect, if we don't, we might need to do a bit of hackery :tada:

Obviously, services wouldn't be able to talk from one engine to another, but I think that's fine and expected.

The big one for me about this new architecture idea is that it does have rather profound impacts on how we might add rootless mode one day. Containerd rootless is a thing, but then if we're nesting more containers, that might be painful.

Potentially this isn't really an issue - we could just require that rootless in the future use "direct connect" (connect directly to the dagger version you need, instead of going through containerd).

I had a fun idea just now.

What if we could kind of get the best of all the worlds? Here's the idea:

We implement tcp/unix/etc as "low-level connection protocols" - these take an address.
We implement containerd or cri as "higher-level connection protocols" - these take an image name (similar to docker-image today), as well as a way to connect to the CRI
- One day, we could implement an optimization for this, where dagger will itself use the containerd socket from there, to avoid that extra level of nesting (and do the fun networking dance/mount propagation that requires).
Then we allow implementing a ton of "specific protocols" - but each of these is either allowed to: 1. connect directly to a specific dagger version, 2. return a connection string to another driver, which can chain.
- So kubernetes could be implemented in one of two ways:
- It could spin up a specified dagger version as a pod, and connect to that.
- It could spin up a containerd instance in a pod, and then return the CRI connection for the CRI driver to connect to.

I quite like this idea, since it means the CRI connection bits would just exist as "another" way of connecting.

The CLI gets it's way - for almost every case (except direct connect with low-level tcp/unix connections to a dagger instance), it can spin up it's own container version, either through the CRI, or through some higher level API.

In this world:

For local runs, we still spin up a container for each version of dagger
For runs on kubernetes/nomad/swarm/etc, we recommend deploying our "runner helper" (which the infra team can manage), and users connect to it using CRI.
For runs on partner platforms, partners could do whatever they like behind the scenes - they just get a request saying "give me this dagger version", and they go and do it (or they can chain into the CRI driver after spinning up containerd).

Feb 15 '24 11:02 jedevc

After a long absence, I am trying to load all this context. The thing which I am missing right now is a whiteboard session. I am trying to re-create that here:

flowchart LR
    CLI
    subgraph "Container Runtime"
        subgraph dagger-runner-helper
            subgraph containerd
                EA[Engine A]
                EB[Engine B]
            end
        end
    end

    CLI -.-> |provisions| dagger-runner-helper
    CLI --> |runs against| EA

If the above diagram is correct, where do the following components fit?

Compute Driver
Kubernetes / Podman / Docker

I have a lot of follow-up questions, but my focus is to resume this, not to be comprehensive. I suspect that some sync time with @jedevc would help us make quicker progress. Reaching out via DM to organise something.

Mar 04 '24 15:03 gerhard

CLI controls which version of the Engine gets provisioned.

A Compute Driver is a dagger CLI plugin. .dagger/drivers with e.g. docker binary, kubernetes, etc. Versioned separately from the CLI. STDIN/STDOUT with a net.Conn. Could be a simple shell script.

_EXPERIMENTAL_DAGGER_DRIVER_URL=(scheme://)(driver-specific-implementation)
dagger [--experimental-driver-url] (run|...)

Mar 07 '24 11:03 gerhard

Synced with @gerhard, just wanted to write up some of my notes:

Agreed that the purpose should be to provide an agnostic way of provisioning dagger engines - not just tied to the docker CLI. Additionally, this should provide part of the solution for collapsing versions - where the version of the CLI dictates the version of the engine.
We discussed what the DX should look like - should we have an explicit dagger provision command, or should this be implicit? How should a driver be configured?

We ended up on the code snippet above, where a user can set an environment variable (naming hard ofc), pass a flag, etc with a URL. The URL scheme is the name of the driver of lookup in ~/.dagger/drivers, while the rest of the URL can be user-defined (just similar to buildkit's connhelpers today).
We discussed the complexity of interacting with pre-provisioned environments - where we would connect directly using tcp:///unix:// or even to our own CI environment. My end proposal for this would be to:
- Discourage users connecting to pre-provisioned dagger instances at all.
- When users to connect to a pre-provisioned instance without a driver, we should enforce the constraint that the CLI + engine versions match with a hard error.
- The biggest blocker of this today is pre-provisioned kubernetes environments - to tackle this, we should provide a kubernetes driver (which we could potentially adopt in our own CI)
  - A first iteration of this will be similar to the buildx kubernetes driver, where end-users/ci jobs need some level of access to create pods in a cluster
  - A future iteration of this would be closer to a full-on dagger operator in kubernetes - which would also provide a canonical way to run dagger in this setup.
We discussed that we liked the idea of it being very simple to create a driver - something as simple as a shell script. This does limit the interface a bit, but makes the barrier for entry very low.
We discussed the issue of lifetimes and implicit vs explicit provisioning. If a dagger call/etc causes an engine to be provisioned, when should it be shut-down?
- I'd like to allow drivers to make this call themselves - some may decide to kill the engine, immediately after the client exits, some might want to configure the engine to terminate after some amount of time after no builds have been run.
- Additionally, I'd like to avoid introducing an explicit de-provision into drivers. For one, these aren't reliable (due to client crashes, forgetfulness, etc) as we've seen with the docker-image garbage collection today, and secondly, they require some sort of shared state/synchronization - which not every driver would be able to do.
- We also wondered whether it should be possible to manage engines created through the driver - e.g. operations like list/edit/delete. Potentially drivers should be able to do this, but maybe we could omit this for a first-pass.

In general, I think we're on a similar page:

Reduce scope, ship something experimentally and get feedback, iterate quickly
Treat this as mostly orthogonal to any containerd/engine-in-engine work - if drivers are successful, this does make it a little less urgent, but it could be used as a fallback for when manual pre-provisioning has been done.
Make the interface as simple as possible, but open it up to be extended over time with additional features.

If we do this, I have funny little idea - I'd like to (personally) make a cri:// driver. All that would do would allow provisioning dagger engines through the CRI interface as we discussed above.

Now, this would potentially be another (a bit more limited) way to deploy dagger in kubernetes:

Deploy a containerd container in kubernetes, and expose it as a service
Connect the dagger CLI to the containerd CRI with the driver
Now, we can spin up arbitrary dagger engines in kubernetes, but without needing the end-user to have a ton of permissions.

Small thought that came up as a little implementation detail that I had in discussion with @sipsma and @shykes: how does this interact with clustered dagger / dagger module versions / etc.

One thought I had is that for this, we could create a new session API that the client exposes to "spin up a new engine with version X" - when a dagger instance wants a new node (maybe for more compute power, or maybe with a different version, or a different architecture), it could use this session API to create the node.

For pre-provisioned instances, we obviously couldn't do that - but there's some options there:

Don't allow these features for pre-provisioned instances (do they even make sense?)
Deprecate pre-provisioned instances entirely, drivers only!
Fallback to spinning up engine-in-engine containers (but obviously doesn't give us more compute power / different architectures / etc)

Mar 07 '24 12:03 jedevc

Resurrecting this thread in the context of #9516 .

I worry that we have not made progress towards consensus on these issues. cc @gerhard @jedevc @sipsma

Feb 09 '25 22:02 shykes