agent-protocol
agent-protocol copied to clipboard
Proposal: Agent healthcheck endpoint to advance production-readiness of deployed agents
Is your feature request related to a problem? Please describe. As it stands, there is currently no universal standard in the spec for liveness/readiness of an Agent running in the context of a system. Until I receive an vendor-Agent/model/custom-Agent bad response to a task-related request, I don't know there are problems with my Agent (within the context of the spec).
At a high level, I believe an open protocol capably applied across all Agent implementations should treat an Agent as any other service within a stack and consider things like integrating observability, events/metrics, shutdown, etc, but this proposal is limited in scope to a binary health/unhealth discovery implementation.
Describe the solution you'd like
Kubernetes actually has a great solution in the form of two health check probes, liveness and readiness. In such context, the liveness
healthcheck returns either a 200 or unhealthy HTTP status like 400/500 and indicates that the service is alive, and the readiness
healthcheck does the same but ensures all dependencies are also alive (such as connection to a database).
An Agent that depends on connection to a single model is arguably not dependent upon any external resources as it could be considered unalive/not healthy
if there is no connection to its single LLM, but I think we could easily see a future where a single orchestrator Agent is facilitating Agent interactions between multiple models, and a truly universally applied spec needs to consider such circumstances.
An endpoint /ap/v1/agent/health_check
would be a good place to capture health-related inquiries and I'd love to hear more discussion from there about:
- Whether liveness_and_ readiness are both requirements
- Whether it needs to be a concern of the spec beyond providing a dedicated endpoint to query health and can be left as an Agent-implementation concern from there
I think an implementation that lacks the ability to decipher if a non-200 response is due to a failure of my Agent to start or a bug in my implementation is a poor developer experience, so Agent-implementations will solve for this on their own and will undoubtedly differ in their implementations without a protocol-driven spec.
Describe alternatives you've considered The alternative is mostly just not providing a way to query an Agent's healthy state, which is the current status of the Agent Protocol. Agent health is the responsibility of the Agent-specific implementation and not of the protocol, which leads to a lack of consistency and will promote vendor lock-in should Agents evolve to 3rd party SaaS tooling.
Additional context
It could be worthwhile discussion to extend this conversation to things like Agent deployment versioning /ap/v1/agent/version
and other deployment state/service context discovery as well, but again, this proposal is limited in scope to solely a health check.
I'm curious what your thoughts are on the info endpoint, I posted a potential schema for it here in Issue #39.
It didn't go over a health check for readiness and liveliness, but I think that it could be added there technically. Or some other type of standardized status message.
The schema we were considering does include a version for both the Agent Protocol and the Agent itself, which would be useful for clients.
That's an interesin
I'm curious what your thoughts are on the info endpoint, I posted a potential schema for it here in Issue #39.
It didn't go over a health check for readiness and liveliness, but I think that it could be added there technically. Or some other type of standardized status message.
The schema we were considering does include a version for both the Agent Protocol and the Agent itself, which would be useful for clients.
That's an interesting proposal @jzanecook. I have some thoughts about that info
proposal that are distinctly different form this one, but I believe they are and should be separate concerns. I think it's worthwhile deciding earlier rather than later how prescriptive the agent protocol
needs to be. For what it's worth, I prefer open/extensible protocols rather than restrictive ones, and I also believe a mechanism in the protocol for Agent metadata is valuable but is a separate concern from liveness or readiness.
I would find a lot of value in being able to GET liveness with a 200 response as good enough to detect liveness
of an Agent. I think the readiness
check is distinctly different because it indicates that dependent resources are also alive and responsive. Agent metadata is also valuable but a GET for info
/metadata
should be different than a GET for a factor of health
. My preference in RESTful architectures (since this protocol appears to assume HTTP so far) is always to keep separate endpoints for separate resources.