conmon icon indicating copy to clipboard operation
conmon copied to clipboard

healthcheck feature

Open samifruit514 opened this issue 3 months ago β€’ 19 comments

context: We are running a multi-container application with docker-compose, but through podman (podman system service --log-level debug unix:///tmp/podman.sock). The apps definitions inside the docker-compose.yml file contains a bunch of health checks and dependencies. Since we run that in a CI, WITHOUT systemd, there is no healthchecks (no unit are created for healthchecks). Because of that, the multi-container app doesnt run.

According to podman people, conmon should handle health checks, or at least, conmon would be a great candidate to do the healthchecks.

This PR accepts --enable-healthcheck in conmon args, enabling the healthchecks from conmon to podman, through the unix pipe (the same one that sends the PID to link).

For more info on healthcheck handling by podman: https://github.com/containers/podman/pull/27033

samifruit514 avatar Sep 13 '25 21:09 samifruit514

I think this won't work for the reasons https://github.com/containers/conmon/pull/598 doesn't - no support for swapping timers to transition startup healthchecks to regular healthchecks.

mheon avatar Sep 13 '25 21:09 mheon

Ephemeral COPR build failed. @containers/packit-build please check.

Thanks for taking a look mheon! its really appreciated. I've made the changes to handle healthchecks in starting phase. Let me know your feedback! Thanks

samifruit514 avatar Sep 13 '25 22:09 samifruit514

I've made a PR on podman side: It would read from the conmon's pipe: https://github.com/containers/podman/pull/27067 Sorry if its too noisy. Set that PR to draft for now. Its to give a global overview on how it would work

samifruit514 avatar Sep 13 '25 23:09 samifruit514

I'm a bit hesitant about this one because we are strongly contemplating a Conmon rewrite to be released alongside Podman 6 in May - https://github.com/containers/podman/pull/27053 - so additional work on the existing, C conmon seems a bit wasted in light of that. But if this can be done for our November release of 5.7 and that's of use to you, I'm not entirely opposed?

mheon avatar Sep 13 '25 23:09 mheon

integration tests failed for cri-o. Would it be possible that its some race condition? I don't see how this is related to the changes in this PR πŸ€” AI suggests to put some sleep in the bats tests on cri-o side

samifruit514 avatar Sep 18 '25 13:09 samifruit514

I was able to run the "crio restore with missing config.json" test locally:

Running the test...
restore_test_cgroupfs.bats
 βœ“ crio restore with missing config.json

1 test, 0 failures

βœ“ Test passed successfully!
Cleaning up...
Test completed successfully!

its unclear what happened but I have the feeling that running it again could pass? Do we experience flaky in CI sometimes?

@jnovy Sorry to bother you again, could you run the integration test if you get a chance?

samifruit514 avatar Sep 19 '25 21:09 samifruit514

You are right @samifruit514 - integration failures don't seem related to the PR.

Now let's simplify the code as it is super-bloated in the current state - we can remove about two thirds of the code still maintaining the functionality and also drop the JSON parsing functionality completely:

  1. remove the hash-table - there is only one instance of conmon for one container - no hashtable required.

  2. allow podman to invoke heathcheck via CLI parameters instead of JSON parsing, e.g.

    --healthcheck-cmd "curl -f http://localhost:8080/health"
    --healthcheck-interval 30
    --healthcheck-timeout 5
    --healthcheck-retries 3
    --healthcheck-start-period 60
    
  3. Use GLib timer instead of pthreads:

 // Use GLib timeout source instead of pthread
 healthcheck_timer_id = g_timeout_add_seconds(healthcheck_interval,
                                             healthcheck_timer_callback, NULL);
  1. Use static string constants:

    static const char *status_strings[] = {
        "none", "starting", "healthy", "unhealthy"
    };
    

Then error handling and memory management can be significantly simplified. Also, there is no test coverage but let's cover it once the above is done - and code simplified. Does it make sense @samifruit514 ?

jnovy avatar Sep 22 '25 09:09 jnovy

Awsome. makes sense! I will do the changes

samifruit514 avatar Sep 22 '25 11:09 samifruit514

Thanks, there are merge conflicts in src/cli.[ch] - these need to be addressed first. Also I see the hash_table functions were not removed yet?

jnovy avatar Sep 23 '25 07:09 jnovy

Indeed, conflicts πŸ˜…. I still need to get more comfortable with PRs against upstreams. They are now resolved, thanks

As for the hash_table function, I cleaned it upβ€”it was just leftover code from when I mistakenly thought conmon handled multiple containers 🀦.

I also reviewed the rest of the code with my best effort (and my somewhat limited C knowledge, plus a little AI help πŸ˜‰) to make sure everything is as clean as possible.

samifruit514 avatar Sep 23 '25 12:09 samifruit514

Sorry to bother you again @jnovy , do you have the time to review the changes?

samifruit514 avatar Sep 26 '25 18:09 samifruit514

@jnovy :Sorry to bother you, once again! do you have the time to review the changes? I think we are close to have something. I also fixed the integration (bats) tests so they always pass

samifruit514 avatar Nov 03 '25 22:11 samifruit514

Personally I don't see the point for this, we are in the process rewriting in rust so I rather not add large features here that increase the scope of the rewrite.

But also more importantly I don't see how this can help podman here, from a quick look this is calling the oci runtime directly instead of the proper podman healthcheck run command. Podman does a lot more as it tracks the metadata, stores the stdout/err and so on. It also has two different timers startup and the regular healtcheck so I really don't see how this here will solve the podman problem at all.

enabling the healthchecks from conmon to podman, through the unix pipe (the same one that sends the PID to link).

Podman is not a daemon, podman run -d exists right away after container start so the pipe will be broken and useless. So overall I don't think the design makes sense for podman.

If we want to add healtchcheck to conmon then I think we first need a proper design discussion.

Luap99 avatar Nov 20 '25 10:11 Luap99

hey @Luap99 ! thanks for your feedback. Really appreciated! This healthcheck feature is designed for podman system service, I get the point. We need something universal.

How about this: We dont use the pipe. We simplify conmon's logic by just keeping the "interval" setting to have similar behavior as podman's systemd timers which is creating timers with just "interval" as input. The conmon's glib timers would behave like in systemd timers which is running the podman healthcheck run command, which would be passed in arg.

@jnovy a bit different to what we made so far, but it is decoupled from podman. what do you think?

samifruit514 avatar Nov 20 '25 15:11 samifruit514

I'm back @samifruit514 - I agree with @Luap99 that this feature deserves proper design discussion so that conmon and podman interaction is well thought through and functional. On the other hand I don't necessarily think that discussion and implementation can't happen here - involving @jankaluza - as the feature can land sooner than the time allows in the Rust reimplementation. Shall we start the design discussion now?

jnovy avatar Nov 24 '25 10:11 jnovy

Hello and apologies for butting in out of nowhere like this.

We've had company internal discussions about standardizing the Linux distributions to use for our deployments and have landed on Alpine - which in turn uses OpenRC, which means that Podman containers deployed here do not trigger healthchecks or livelyness probes. Which is... not fun to deal with, frankly speaking. I did find a workaround using fcron and setting a rather aggressive 1s interval, but I would much rather know that this service is part of how containers are ran in Podman - in particular, Conmon in this case.

Hence, I wanted to know what the current status of this PR is in terms of seing actual implementation? I do see that a Rust-rewrite is planned, so this will likely factor into the timeline quite a lot.

Thank you and kind regards!

senpro-ingwersenk avatar Dec 11 '25 15:12 senpro-ingwersenk

I will reiterate that we are actively migrating away from Conmon to a new version written in Rust, which will include significant architectural changes. I'd like for integrated healthchecks to be one of those changes. So a lack of activity on this PR does not indicate we've abandoned the idea - just that this particular project is in the process of being replaced, and new feature development is focused elsewhere as a result.

mheon avatar Dec 11 '25 21:12 mheon

@mheon Thank you very much for your elaboration - I fully understand.

In that case, what'd be the best way to observe the progress of the Rust port itself? Should I keep an eye out on a certain branch or is there some kind of milestone ticket? I would love to keep up to date with that implementation. :)

senpro-ingwersenk avatar Dec 15 '25 06:12 senpro-ingwersenk