ramalama Add healthcheck on serve

Feature request description

Add a health check when running ramalama serve.

Suggest potential solution

@rhatdan you might have a better idea as to how this could best be implemented but one route might be to add the healthcheck on serve here. There is already an implementation for is_healthy and wait_for_healthy which can be reused.

One optional extension might include switching from execvp in exec_cmd to popen to hold onto the parent process until we've confirmed the server came up alive.

Have you considered any alternatives?

No response

Additional context

This is a follow-on from the discussion in #1524.

Oct 17 '25 17:10 ieaves

I like the idea of adding a health check to make sure the service comes up. One issue with a health check would be that it really would need to be per/image especially when we get into different Model Runtimes.

Oct 20 '25 14:10 rhatdan

AFAIK the current is_healthy doesn't perform a /health or so API request, but checks if the expected model is listed in the /models API response. This is a special case explicitly useful for the ramalama run command since you want to start chatting right away - it is supposed to block until the server and the model are available. @ieaves Do you have an health check for the server or the specific model (as it currently is) in mind?

Yes, I think this depends fully on the used model runtime / inference engine. In addition to the above, the current check also only works for llama.cpp. Similar to assembling the command, we could specify parameters of the health check in the inference spec and delegate to the actual, generalized implementation in RamaLama. For example:

commands:
...
api-checks:
   # llama-server health check
   # see: https://github.com/ggml-org/llama.cpp/pull/9056
   health: "/health"
   
   # check if model xyz is served and running
   model-served: 
      path: "/models"
      name: "{{ model.model_name }}"
   ...

Oct 21 '25 08:10 engelmi

AFAIK the current is_healthy doesn't perform a /health or so API request, but checks if the expected model is listed in the /models API response.

I wanted this to be a good first issue for newcomers to the project so my primary goal was suggesting a reasonable implementation path for someone not intimately familiar with the project already. Hence leveraging existing functionality as a first step. Runtime specific /health endpoints would absolutely be preferable though.

I believe all of our current runtimes serve over an openai API compatible interface at the moment so shouldn't /v1/models be universally available? I know it works on mlx in addition to llama.cpp.

Oct 27 '25 17:10 ieaves

A friendly reminder that this issue had no activity for 30 days.

Nov 27 '25 00:11 github-actions[bot]

I am dealing with weird issue that when I already have ramalama run going fast after the initial download, ramalama serve just can't get as fast as that using the existing cached model data. The health checker would be useful when testing curl time

Dec 19 '25 05:12 TomLucidor