sdk-python [FEATURE] Multi-agent pattern: `Arena`

Problem Statement

AFAIK we don't have a pattern for: "I'm not sure which approach is best at performing a task beforehand, so try several at the same time and let a judge decide."

The Agent-as-a-Judge paradigm (that builds on top of the more common LLM-as-a-Judge) would fit naturally here.

Proposed Solution

from strands.multiagent import Arena
from strands import Agent

judge_agent = Agent(
   system_prompt="""You are a judge. Evaluate the solutions provided and pick the best one.
   Use your tools to verify claims, run code, check facts.
   Return your verdict with reasoning.""",
   tools=[run_code, verify_facts]
)

result = Arena(
  agents=[agent_a, agent_b, agent_c],
  judge=judge_agent,
).run("Design an API for user authentication")

Agents run in parallel and the judge evaluates, making the winner "emerge".

Use Case

I have a few different agent/multi-agent configurations and I want to know which one works best. I'm trying different prompts/ comparing models/ testing whether adding a tool actually helps, etc...I don't know which one will perform better on this task beforehand, so I want to run them all and choose the one who performs better.

The "Judge" agent can verify the outputs and pick the one it considers best. Because it's an agent, it can use tools (and all the agent functionality) to validate results rather than just comparing text.

Alternatives Solutions

No response

Additional Context

Agent-as-a-Judge: Evaluate Agents with Agents When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs

Dec 07 '25 19:12 stefanoamorelli

I've been working on this use case internally and would be happy to contribute.

Dec 07 '25 19:12 stefanoamorelli

@stefanoamorelli Love to collaborate on this

Dec 20 '25 05:12 dk67604