toc icon indicating copy to clipboard operation
toc copied to clipboard

[Initiative]: Cloud-Native Foundations for Distributed Agentic Systems

Open caldeirav opened this issue 6 months ago • 25 comments

Name

Cloud-Native Foundations for Distributed Agentic Systems

Short description

Formalise principles, reference patterns and ecosystem strategy for running massively distributed systems of collaborating AI agents on Kubernetes and other cloud-native substrates.

Responsible group

TOC

Does the initiative belong to a subproject?

Yes

Subproject name

TOC Artificial Intelligence Initiatives

Primary contact

Vincent Caldeira ([email protected])

Additional contacts

Ricardo Aravena ([email protected])

Initiative description

The purpose of this initiative is to have a CNCF AI WG sub-stream that formalises principles, reference patterns and ecosystem gaps for running massively distributed systems of collaborating AI agents on Kubernetes and other cloud-native substrates.

Scope Definition

The group will focus on architectural guidance and identifying needs for standards definition, not on defining a new runtime model or deep-diving into framework & implementation.

  • Protocol interoperability: Can the community converge on Model Context Protocol (MCP) as the default agent-tool and agent-agent wire spec, and how should it be integrated into cloud-native systems? What auth, discovery and streaming extensions are required for cluster and multi-cluster use?
  • Agentic Gateway / Data Plane: Traditional REST-centric proxies can’t handle MCP & A2A session fan-out, bidirectional SSE, protocol negotiation or per-agent tenancy. How do we specify a gateway pattern that is session-aware, JSON-RPC–aware, secure and resource-efficient? What minimum behaviours (multiplexing, retries, streaming, auth, tracing) are required for conformance?
  • Runtime abstraction: How should an Agent be modelled in cloud-native terms (Pod? CRD? Side-carless process)? What lifecycle hooks and retry semantics are necessary for autonomous, long-running tasks? ​ State & memory | Which back-ends (object store, vector DB, Redis) and API shapes are suitable for short-term and long-term agent memory? How do we ensure consistency and garbage-collection across thousands of agents?
  • Fault Tolerance: Patterns for handling fault-tolerance in agentic systems based not solely on execution but also on output quality.
  • Observability & Policy Management: Define OpenTelemetry spans and policy CRDs (Kyverno/Gatekeeper) so SREs can trace, limit and audit autonomous behaviour.

Why it matters to CNCF

  • Next wave of workloads: Agentic AI shifts compute from monolithic LLM calls to dynamic swarms of small, interactive tasks. This stresses the very areas—scalability, resilience, observability. where Kubernetes and cloud-native projects excel.
  • Avoid one-off silos: Vendors are already shipping proprietary agent platforms. A neutral CNCF framework guiding and normalising these different approaches can prevent fragmentation and foster portability, just as OCI normalised container images.
  • Fills an enterprise adoption gap: A purpose-built approach for agentic at the gateway layer may be required because auth, tenancy and traffic shaping are missing. Providing a CNCF-blessed spec/blueprint could support standard ways of addressing this through cloud-native traffic management. ​
  • Leverages Envoy heritage while staying protocol-neutral: A common spec can lets Envoy-style extensions, Rust-based proxies, or service-mesh datapaths compete while preserving interoperability.
  • Attracts new contributors: Identifying gaps (e.g., agent memory APIs, MCP-K8s discovery) invites fresh projects to join the landscape and advances CNCF’s leadership in AI infrastructure.

Key technologies & projects involved

  • Communication Protocols: Model Context Protocol (MCP), gRPC, CloudEvents
  • Agent-to-Agent Gateway: Agent Gateway, Envoy-MCP filter POC, Gloo AI Gateway, A2A protocol
  • Runtime coordination: Dapr Agents, Kagent
  • Scheduling / scaling: Kubernetes Scheduler, KEDA, Kueue, Dynamic Resource Allocation (DRA)
  • State / memory: Dapr state components, vector-DB operators (Chroma, Milvus), S3/GCS
  • Eventing & workflows: Knative Eventing, Argo Workflows, Temporal
  • Observability & Policy Management: OpenTelemetry, Kyverno/Gatekeeper, SPIFFE/SPIRE

Deliverable(s) or exit criteria

  1. Publish “Foundations for Distributed AI Agents” whitepaper (≤ 12 pp): Describes protocol, runtime, state, scheduling and safety patterns; maps research challenges.
  2. Produce reference architecture & pattern catalogue ‌​
  3. Standards & API proposals including draft enhancement for “MCP-for-Clusters” (auth, discovery, streaming) and high-level sketch of "Agent CRD" schema and lifecycle states for WG App-Delivery & SIG-Apps review.
  4. Gap analysis & incubation map identifying where new projects (e.g., AgentMemory API, AgentBench-CN) or SIG plugins are needed.
  5. Cross-WG alignment providing a formal liaisons with WG Serving (routing/benchmarks), Device-Management (GPU partitioning for agents), TAG Security (tool-scope policy), SIG Autoscaling (agent-aware HPA) around agentic topics.
  6. Look into approach for a conformance/observability spec defining a minimal OpenTelemetry schema for agent spans and cost/energy labels.

caldeirav avatar Jun 06 '25 03:06 caldeirav

@caldeirav there are a number of deliverables -- perhaps this could be broken up into multiple Initiatives?

@riaankleinhans okay to move to vote

angellk avatar Jun 24 '25 04:06 angellk

@caldeirav there are a number of deliverables -- perhaps this could be broken up into multiple Initiatives?

+1

raravena80 avatar Jun 24 '25 13:06 raravena80

I love this proposal, cloud native agent! I think this kagent is lacking a layer for agent orchestration, like register a new agent, unregister the agent, agent auto discovery via registration, routing requests to different agents etc., I have a dirty poc here https://github.com/gyliu513/aichestra, also a blog https://gyliu513.medium.com/building-an-intelligent-multi-agent-orchestration-system-with-langgraph-a2a-and-mcp-674efdf666f7

gyliu513 avatar Jul 01 '25 14:07 gyliu513

I have been doing quite a bit of work on patterns recently, what is really interesting is that many patterns are similar to those used when building microservice based systems, however; quite often there are minor nuances that need modifications to existing software. Then there are things like Prompt Injection which while similar to input validation used in microservices would require completely new software. I think it could be really interesting to look at how a pattern catalog could also highlight these links to the current CNCF landscape or even highlight gaps in the landscape.

nicholasjackson avatar Jul 01 '25 14:07 nicholasjackson

This initiative has been approved by the TOC and is ready to be worked on with the appropriate TAG and TOC liaison.

riaankleinhans avatar Jul 07 '25 15:07 riaankleinhans

I like the initiative Victor, great input, does it make sense to discuss in the next TOC call how each deliverable should be handled (co-currently / sequentially), and timelines for delivery.

One separate comment on the proposal, regarding technologies and projects, I didn't see wasmedge mentioned, is it worth considering for inclusion?

joshhalley avatar Jul 07 '25 15:07 joshhalley

@caldeirav following on from the [TOC Artificial Intelligence Initiatives Meeting] 2025-08-01 #1686 we discussed getting started with your first listed deliverable:

Protocol interoperability

To avoid the deliverable being open ended, @raravena80 indicated that we should attempt to keep the scope of the content limited to low number (~6) of pages.

As a next steps, can we get the ball rolling with this deliverable, starting with a sharpened scope:

  • When is the start date and end date?
  • How many contributors are you looking for to support this?
  • What is your target page count?
  • Do you already have a GoogleDocs placeholder to being task/section assignment for this deliverable?

joshhalley avatar Aug 06 '25 03:08 joshhalley

drive by opinion. I think a lot of what's written here sounds sensible. My impression is indeed to keep the scope contained, and also to draw experiences in modeling efforts regardless of if they happen in CNCF or not. I'm happy to participate as I can, though likely can't commit to routine meetings.


Here are some notes, and basically we could have a process to use perplexity pro or whatever to ensure we don't over fit to a narrow set of experiences.

One example is that there is another LF agentic system and that isn't a problem rather a good way to swap notes and see if patterns apply outside one or two technical areas. I'm thinking of ACP here https://agentcommunicationprotocol.dev/introduction/welcome

On the same note there are landscape initiatives that have different views of agentic positioning many of which are cloud native tied. I'm thinking of the one from ant group https://github.com/antgroup/llm-oss-landscape

Another is that I don't believe framework devs naturally congregate in specification working groups. You could say sometimes there is a tension here and I think how frameworks map etc require drawing some experts in, without any need to stick around on all the zoom calls ;)

Finally, and on the same point, there are efforts inside CNCF to draw more developers forward. I think @salaboy would be a great mind to pick on this topic.

codefromthecrypt avatar Aug 07 '25 08:08 codefromthecrypt

@codefromthecrypt, thanks for the mention. We are looking into this as the TAG Developer Experience. I do agree that we should be looking also into stuff like ACP. Your feedback and input is highly appreciated here, so keep it coming.

salaboy avatar Aug 07 '25 09:08 salaboy

I would like to contribute

billakantisandeep avatar Aug 07 '25 10:08 billakantisandeep

I would like to contribute

amitpiplapure avatar Aug 07 '25 11:08 amitpiplapure

I would like to contribute. We are actively working on something in this area at Mirantis.

randybias avatar Aug 07 '25 12:08 randybias

Same, I'm definitely interested in contributing and working on this.

justinlevi avatar Aug 07 '25 12:08 justinlevi

Same, would like to contribute

ivc avatar Aug 07 '25 13:08 ivc

Already mentioned by @codefromthecrypt and @salaboy above, I second (or third?) that ACP should be the first one to be looked at, and not just because LF already governs it. I would like to contribute, as we are also exploring this.

Waterfox83 avatar Aug 09 '25 06:08 Waterfox83

I also would like to contribute on this initiative.

payamohajeri avatar Aug 13 '25 20:08 payamohajeri

We discussed a security related requirement that is specific to agentic in the last K8s AI Conformance WG community meeting. It would be great if the community can help identify any emerging standards to be considered for future versions of the AI Conformance Program.

terrytangyuan avatar Aug 19 '25 14:08 terrytangyuan

I would like to contribute too.

@terrytangyuan , where can I find the meeting minutes for the last K8s AI Conformance WG community meeting please? I looked at https://www.cncf.io/training/certification/software-conformance/, https://github.com/kubernetes/community/tree/master/wg-ai-conformance, and https://groups.google.com/a/kubernetes.io/g/wg-ai-conformance, but didn't find it.

fcanogab avatar Aug 29 '25 08:08 fcanogab

on the ACP vs A2A aspect, AFAICT there is no reason for us to be directly concerned now that the lead of ACP is on the A2A team also. https://www.linkedin.com/feed/update/urn:li:activity:7367264720183062528/

codefromthecrypt avatar Aug 30 '25 01:08 codefromthecrypt

This initiative will be presented and discussed at the first CNCF TAG Workloads Foundation Community Meeting.

Next: Wed, Sep 3, 2025 at 13:00 UTC (21:00 SGT / 15:00 CEST / 09:00 EDT ) Join (Zoom): Link Repeats: First Wednesday of each month at 13:00 UTC

@pacoxu @joshhalley

caldeirav avatar Sep 02 '25 07:09 caldeirav

@fcanogab See https://github.com/kubernetes/community/tree/master/wg-ai-conformance#meetings

terrytangyuan avatar Sep 03 '25 02:09 terrytangyuan

Hi @caldeirav we would like to join this discussion and contribute. We have been working on related topics in https://github.com/kagenti/kagenti

pdettori avatar Sep 19 '25 14:09 pdettori

The TAG Workloads Foundation is a good home for this initiative.

angellk avatar Sep 22 '25 19:09 angellk

Same, would like to contribute

ChaoyiHuang avatar Dec 09 '25 00:12 ChaoyiHuang

This can also be interesting for the dev experience TAG as there is a runtime aspect to it but also architecture and developer tooling

salaboy avatar Dec 09 '25 10:12 salaboy