[Initiative]: Cloud-Native Foundations for Distributed Agentic Systems
Name
Cloud-Native Foundations for Distributed Agentic Systems
Short description
Formalise principles, reference patterns and ecosystem strategy for running massively distributed systems of collaborating AI agents on Kubernetes and other cloud-native substrates.
Responsible group
TOC
Does the initiative belong to a subproject?
Yes
Subproject name
TOC Artificial Intelligence Initiatives
Primary contact
Vincent Caldeira ([email protected])
Additional contacts
Ricardo Aravena ([email protected])
Initiative description
The purpose of this initiative is to have a CNCF AI WG sub-stream that formalises principles, reference patterns and ecosystem gaps for running massively distributed systems of collaborating AI agents on Kubernetes and other cloud-native substrates.
Scope Definition
The group will focus on architectural guidance and identifying needs for standards definition, not on defining a new runtime model or deep-diving into framework & implementation.
- Protocol interoperability: Can the community converge on Model Context Protocol (MCP) as the default agent-tool and agent-agent wire spec, and how should it be integrated into cloud-native systems? What auth, discovery and streaming extensions are required for cluster and multi-cluster use?
- Agentic Gateway / Data Plane: Traditional REST-centric proxies can’t handle MCP & A2A session fan-out, bidirectional SSE, protocol negotiation or per-agent tenancy. How do we specify a gateway pattern that is session-aware, JSON-RPC–aware, secure and resource-efficient? What minimum behaviours (multiplexing, retries, streaming, auth, tracing) are required for conformance?
- Runtime abstraction: How should an Agent be modelled in cloud-native terms (Pod? CRD? Side-carless process)? What lifecycle hooks and retry semantics are necessary for autonomous, long-running tasks? State & memory | Which back-ends (object store, vector DB, Redis) and API shapes are suitable for short-term and long-term agent memory? How do we ensure consistency and garbage-collection across thousands of agents?
- Fault Tolerance: Patterns for handling fault-tolerance in agentic systems based not solely on execution but also on output quality.
- Observability & Policy Management: Define OpenTelemetry spans and policy CRDs (Kyverno/Gatekeeper) so SREs can trace, limit and audit autonomous behaviour.
Why it matters to CNCF
- Next wave of workloads: Agentic AI shifts compute from monolithic LLM calls to dynamic swarms of small, interactive tasks. This stresses the very areas—scalability, resilience, observability. where Kubernetes and cloud-native projects excel.
- Avoid one-off silos: Vendors are already shipping proprietary agent platforms. A neutral CNCF framework guiding and normalising these different approaches can prevent fragmentation and foster portability, just as OCI normalised container images.
- Fills an enterprise adoption gap: A purpose-built approach for agentic at the gateway layer may be required because auth, tenancy and traffic shaping are missing. Providing a CNCF-blessed spec/blueprint could support standard ways of addressing this through cloud-native traffic management.
- Leverages Envoy heritage while staying protocol-neutral: A common spec can lets Envoy-style extensions, Rust-based proxies, or service-mesh datapaths compete while preserving interoperability.
- Attracts new contributors: Identifying gaps (e.g., agent memory APIs, MCP-K8s discovery) invites fresh projects to join the landscape and advances CNCF’s leadership in AI infrastructure.
Key technologies & projects involved
- Communication Protocols: Model Context Protocol (MCP), gRPC, CloudEvents
- Agent-to-Agent Gateway: Agent Gateway, Envoy-MCP filter POC, Gloo AI Gateway, A2A protocol
- Runtime coordination: Dapr Agents, Kagent
- Scheduling / scaling: Kubernetes Scheduler, KEDA, Kueue, Dynamic Resource Allocation (DRA)
- State / memory: Dapr state components, vector-DB operators (Chroma, Milvus), S3/GCS
- Eventing & workflows: Knative Eventing, Argo Workflows, Temporal
- Observability & Policy Management: OpenTelemetry, Kyverno/Gatekeeper, SPIFFE/SPIRE
Deliverable(s) or exit criteria
- Publish “Foundations for Distributed AI Agents” whitepaper (≤ 12 pp): Describes protocol, runtime, state, scheduling and safety patterns; maps research challenges.
- Produce reference architecture & pattern catalogue
- Standards & API proposals including draft enhancement for “MCP-for-Clusters” (auth, discovery, streaming) and high-level sketch of "Agent CRD" schema and lifecycle states for WG App-Delivery & SIG-Apps review.
- Gap analysis & incubation map identifying where new projects (e.g., AgentMemory API, AgentBench-CN) or SIG plugins are needed.
- Cross-WG alignment providing a formal liaisons with WG Serving (routing/benchmarks), Device-Management (GPU partitioning for agents), TAG Security (tool-scope policy), SIG Autoscaling (agent-aware HPA) around agentic topics.
- Look into approach for a conformance/observability spec defining a minimal OpenTelemetry schema for agent spans and cost/energy labels.
@caldeirav there are a number of deliverables -- perhaps this could be broken up into multiple Initiatives?
@riaankleinhans okay to move to vote
@caldeirav there are a number of deliverables -- perhaps this could be broken up into multiple Initiatives?
+1
I love this proposal, cloud native agent! I think this kagent is lacking a layer for agent orchestration, like register a new agent, unregister the agent, agent auto discovery via registration, routing requests to different agents etc., I have a dirty poc here https://github.com/gyliu513/aichestra, also a blog https://gyliu513.medium.com/building-an-intelligent-multi-agent-orchestration-system-with-langgraph-a2a-and-mcp-674efdf666f7
I have been doing quite a bit of work on patterns recently, what is really interesting is that many patterns are similar to those used when building microservice based systems, however; quite often there are minor nuances that need modifications to existing software. Then there are things like Prompt Injection which while similar to input validation used in microservices would require completely new software. I think it could be really interesting to look at how a pattern catalog could also highlight these links to the current CNCF landscape or even highlight gaps in the landscape.
This initiative has been approved by the TOC and is ready to be worked on with the appropriate TAG and TOC liaison.
I like the initiative Victor, great input, does it make sense to discuss in the next TOC call how each deliverable should be handled (co-currently / sequentially), and timelines for delivery.
One separate comment on the proposal, regarding technologies and projects, I didn't see wasmedge mentioned, is it worth considering for inclusion?
@caldeirav following on from the [TOC Artificial Intelligence Initiatives Meeting] 2025-08-01 #1686 we discussed getting started with your first listed deliverable:
Protocol interoperability
To avoid the deliverable being open ended, @raravena80 indicated that we should attempt to keep the scope of the content limited to low number (~6) of pages.
As a next steps, can we get the ball rolling with this deliverable, starting with a sharpened scope:
- When is the start date and end date?
- How many contributors are you looking for to support this?
- What is your target page count?
- Do you already have a GoogleDocs placeholder to being task/section assignment for this deliverable?
drive by opinion. I think a lot of what's written here sounds sensible. My impression is indeed to keep the scope contained, and also to draw experiences in modeling efforts regardless of if they happen in CNCF or not. I'm happy to participate as I can, though likely can't commit to routine meetings.
Here are some notes, and basically we could have a process to use perplexity pro or whatever to ensure we don't over fit to a narrow set of experiences.
One example is that there is another LF agentic system and that isn't a problem rather a good way to swap notes and see if patterns apply outside one or two technical areas. I'm thinking of ACP here https://agentcommunicationprotocol.dev/introduction/welcome
On the same note there are landscape initiatives that have different views of agentic positioning many of which are cloud native tied. I'm thinking of the one from ant group https://github.com/antgroup/llm-oss-landscape
Another is that I don't believe framework devs naturally congregate in specification working groups. You could say sometimes there is a tension here and I think how frameworks map etc require drawing some experts in, without any need to stick around on all the zoom calls ;)
Finally, and on the same point, there are efforts inside CNCF to draw more developers forward. I think @salaboy would be a great mind to pick on this topic.
@codefromthecrypt, thanks for the mention. We are looking into this as the TAG Developer Experience. I do agree that we should be looking also into stuff like ACP. Your feedback and input is highly appreciated here, so keep it coming.
I would like to contribute
I would like to contribute
I would like to contribute. We are actively working on something in this area at Mirantis.
Same, I'm definitely interested in contributing and working on this.
Same, would like to contribute
Already mentioned by @codefromthecrypt and @salaboy above, I second (or third?) that ACP should be the first one to be looked at, and not just because LF already governs it. I would like to contribute, as we are also exploring this.
I also would like to contribute on this initiative.
We discussed a security related requirement that is specific to agentic in the last K8s AI Conformance WG community meeting. It would be great if the community can help identify any emerging standards to be considered for future versions of the AI Conformance Program.
I would like to contribute too.
@terrytangyuan , where can I find the meeting minutes for the last K8s AI Conformance WG community meeting please? I looked at https://www.cncf.io/training/certification/software-conformance/, https://github.com/kubernetes/community/tree/master/wg-ai-conformance, and https://groups.google.com/a/kubernetes.io/g/wg-ai-conformance, but didn't find it.
on the ACP vs A2A aspect, AFAICT there is no reason for us to be directly concerned now that the lead of ACP is on the A2A team also. https://www.linkedin.com/feed/update/urn:li:activity:7367264720183062528/
This initiative will be presented and discussed at the first CNCF TAG Workloads Foundation Community Meeting.
Next: Wed, Sep 3, 2025 at 13:00 UTC (21:00 SGT / 15:00 CEST / 09:00 EDT ) Join (Zoom): Link Repeats: First Wednesday of each month at 13:00 UTC
@pacoxu @joshhalley
@fcanogab See https://github.com/kubernetes/community/tree/master/wg-ai-conformance#meetings
Hi @caldeirav we would like to join this discussion and contribute. We have been working on related topics in https://github.com/kagenti/kagenti
The TAG Workloads Foundation is a good home for this initiative.
Same, would like to contribute
This can also be interesting for the dev experience TAG as there is a runtime aspect to it but also architecture and developer tooling