dagger icon indicating copy to clipboard operation
dagger copied to clipboard

Allow easy healthcheck for the dagger engine

Open jedevc opened this issue 1 year ago • 8 comments

Spun out from this discord discussion: https://discord.com/channels/707636530424053791/1253636253358755840/1253636253358755840

It's currently quite difficult to write a healthcheck for the dagger engine itself - how do you know if it's running, and ready to accept connections?

Previously, you used to be able to use the buildctl command to connect manually (but that was removed), or even use a dummy dagger query command (but that was also removed).

We should make it easy to run a command in the engine container to see if the engine is running and healthy. Ideally, this would mean shipping dagger inside the engine container, and maybe even providing a simple dagger ping or dagger health command to check communication (we could use this in our own CI as well).

jedevc avatar Jun 24 '24 11:06 jedevc

Previously, you used to be able to use the buildctl command to connect manually (but that was removed), or even use a dummy dagger query command (but that was also removed).

Did we remove the buildctl CLI from the image? At the moment we are still using buildctl in our helm chart: https://github.com/dagger/dagger/blob/8aac1a62a204529ded0d9aebafdcdcb5df2397be/helm/dagger/templates/engine-daemonset.yaml#L71-L73

matipan avatar Jun 24 '24 12:06 matipan

Oops. That's not right :scream:

https://github.com/dagger/dagger/blob/8aac1a62a204529ded0d9aebafdcdcb5df2397be/ci/build/builder.go#L257-L259

buildctl needs to exist, but it's a symlink to dial-stdio - this was changed ages ago, in https://github.com/dagger/dagger/pull/6100#issuecomment-1809271407.

jedevc avatar Jun 24 '24 12:06 jedevc

Good to know!

What does dial-stdio right now do for us in the context of health check?

matipan avatar Jun 24 '24 13:06 matipan

Uh, it hangs until stdin is closed :thinking:

jedevc avatar Jun 24 '24 13:06 jedevc

I think the solution is:

  1. Bundle the CLI and the engine together into a single image (#6887)
  2. Add a simple ping() String! function in the dagger api

Then the healtcheck can simply be dagger query << '{ping}' or, in the future, dagger core ping

shykes avatar Jun 24 '24 15:06 shykes

I'm planning on adding a whole engine API under query in our graphql API (to support cache query and control as part of https://github.com/dagger/dagger/pull/7646), at which point we can move version and/or add ping under there. So then this would be dagger core engine version/dagger core engine ping.

Bundle the CLI and the engine together into a single image

Still agree with doing this generally speaking, but I don't think it would be a pre-req for implementing this functionality. At least that I can see it's orthogonal.

sipsma avatar Jun 24 '24 17:06 sipsma

I'm planning on adding a whole engine API under query in our graphql API (to support cache query and control as part of #7646), at which point we can move version and/or add ping under there. So then this would be dagger core engine version/dagger core engine ping.

Isn't the entire API the engine API already? What's the rule for what goes under engine and what doesn't?

Bundle the CLI and the engine together into a single image

Still agree with doing this generally speaking, but I don't think it would be a pre-req for implementing this functionality. At least that I can see it's orthogonal.

Isn't the issue that the dagger binary isn't available in the image, therefore there is no reliable way to query the API for a healthcheck? ie. you need dagger installed to run dagger query. Or am I missing something?

shykes avatar Jun 24 '24 21:06 shykes

Isn't the entire API the engine API already? What's the rule for what goes under engine and what doesn't?

We call it the core API; the engine API would be for all the global state of the engine as a whole, so things like it's version, it's cache configuration, it's current disk usage, manual pruning, etc.

Open to bikeshedding on the name as always but that's what I was imagining regardless of the name.

Isn't the issue that the dagger binary isn't available in the image, therefore there is no reliable way to query the API for a healthcheck? ie. you need dagger installed to run dagger query. Or am I missing something?

AFAIK it's fine to require the CLI to run a health check on the engine container. You need something to call to run a health check, may as well be the CLI for the general case in order to handle all the different drivers for connecting to the engine

  • For the particular case of the engine being connected to direct over tcp/unix-sock, then you could just implement this with curl though, you'd just be submitting a gql query.

sipsma avatar Jun 25 '24 22:06 sipsma

Think we can close this out, since we've settled on using dagger core version as a health check, and this is now what we use for our own helm charts as well.

:tada:

jedevc avatar Sep 05 '24 17:09 jedevc