Allow easy healthcheck for the dagger engine
Spun out from this discord discussion: https://discord.com/channels/707636530424053791/1253636253358755840/1253636253358755840
It's currently quite difficult to write a healthcheck for the dagger engine itself - how do you know if it's running, and ready to accept connections?
Previously, you used to be able to use the buildctl command to connect manually (but that was removed), or even use a dummy dagger query command (but that was also removed).
We should make it easy to run a command in the engine container to see if the engine is running and healthy. Ideally, this would mean shipping dagger inside the engine container, and maybe even providing a simple dagger ping or dagger health command to check communication (we could use this in our own CI as well).
Previously, you used to be able to use the buildctl command to connect manually (but that was removed), or even use a dummy dagger query command (but that was also removed).
Did we remove the buildctl CLI from the image? At the moment we are still using buildctl in our helm chart: https://github.com/dagger/dagger/blob/8aac1a62a204529ded0d9aebafdcdcb5df2397be/helm/dagger/templates/engine-daemonset.yaml#L71-L73
Oops. That's not right :scream:
https://github.com/dagger/dagger/blob/8aac1a62a204529ded0d9aebafdcdcb5df2397be/ci/build/builder.go#L257-L259
buildctl needs to exist, but it's a symlink to dial-stdio - this was changed ages ago, in https://github.com/dagger/dagger/pull/6100#issuecomment-1809271407.
Good to know!
What does dial-stdio right now do for us in the context of health check?
Uh, it hangs until stdin is closed :thinking:
I think the solution is:
- Bundle the CLI and the engine together into a single image (#6887)
- Add a simple
ping() String!function in the dagger api
Then the healtcheck can simply be dagger query << '{ping}' or, in the future, dagger core ping
I'm planning on adding a whole engine API under query in our graphql API (to support cache query and control as part of https://github.com/dagger/dagger/pull/7646), at which point we can move version and/or add ping under there. So then this would be dagger core engine version/dagger core engine ping.
Bundle the CLI and the engine together into a single image
Still agree with doing this generally speaking, but I don't think it would be a pre-req for implementing this functionality. At least that I can see it's orthogonal.
I'm planning on adding a whole
engineAPI under query in our graphql API (to support cache query and control as part of #7646), at which point we can moveversionand/or addpingunder there. So then this would bedagger core engine version/dagger core engine ping.
Isn't the entire API the engine API already? What's the rule for what goes under engine and what doesn't?
Bundle the CLI and the engine together into a single image
Still agree with doing this generally speaking, but I don't think it would be a pre-req for implementing this functionality. At least that I can see it's orthogonal.
Isn't the issue that the dagger binary isn't available in the image, therefore there is no reliable way to query the API for a healthcheck? ie. you need dagger installed to run dagger query. Or am I missing something?
Isn't the entire API the engine API already? What's the rule for what goes under engine and what doesn't?
We call it the core API; the engine API would be for all the global state of the engine as a whole, so things like it's version, it's cache configuration, it's current disk usage, manual pruning, etc.
Open to bikeshedding on the name as always but that's what I was imagining regardless of the name.
Isn't the issue that the dagger binary isn't available in the image, therefore there is no reliable way to query the API for a healthcheck? ie. you need dagger installed to run dagger query. Or am I missing something?
AFAIK it's fine to require the CLI to run a health check on the engine container. You need something to call to run a health check, may as well be the CLI for the general case in order to handle all the different drivers for connecting to the engine
- For the particular case of the engine being connected to direct over tcp/unix-sock, then you could just implement this with
curlthough, you'd just be submitting a gql query.
Think we can close this out, since we've settled on using dagger core version as a health check, and this is now what we use for our own helm charts as well.
:tada: