exo feat: add Claude Messages API and OpenAI Responses API support

Motivation

Add support for Claude Messages API and OpenAI Responses API to allow users to interact with exo using these popular API formats. This enables broader compatibility with existing tooling and SDKs that expect these API formats.

Architecture

The API layer uses OpenAI Responses API as the canonical internal format, with Chat Completions and Claude Messages as adapters on top. The Responses API is the most featureful, making it the natural choice for the internal format.

Responses Request → [native] → InternalParams → Runner → TokenChunk → ResponsesResponse
Chat Completions → [adapter] → InternalParams → Runner → TokenChunk → ChatCompletionResponse  
Claude Messages  → [adapter] → InternalParams → Runner → TokenChunk → ClaudeMessagesResponse

All three endpoints now follow the same uniform pattern using adapters.

Changes

New Files

src/exo/shared/types/claude_api.py - Pydantic types for Claude Messages API
src/exo/shared/types/openai_responses.py - Pydantic types for OpenAI Responses API
src/exo/master/adapters/chat_completions.py - Chat Completions adapter (streaming/non-streaming)
src/exo/master/adapters/claude.py - Claude Messages adapter (streaming/non-streaming)
src/exo/master/adapters/responses.py - OpenAI Responses adapter (streaming/non-streaming)

Modified Files

src/exo/master/api.py - Refactored to use adapters uniformly for all endpoints

New Endpoints

POST /v1/messages - Claude Messages API (streaming and non-streaming)
POST /v1/responses - OpenAI Responses API (streaming and non-streaming)

Why It Works

All APIs are implemented as pure conversion adapters:

Incoming requests are converted to internal ChatCompletionTaskParams
The existing ChatCompletion command flow is used unchanged
TokenChunk events are converted back to API-specific response formats via adapters

This approach ensures no changes to core inference logic while supporting different API conventions.

Streaming Formats

Chat Completions: Uses data: {...}\n\n with [DONE] terminator
Claude: Uses event types message_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop
OpenAI Responses: Uses event types response.created, response.in_progress, response.output_item.added, response.content_part.added, response.output_text.delta, response.output_text.done, response.content_part.done, response.output_item.done, response.completed

Test Plan

Manual Testing

Hardware: MacBook Pro M3 Max

Non-streaming tests:

# Chat Completions API
curl -X POST http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 20}'

# Claude Messages API
curl -X POST http://localhost:52415/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}]}'

# OpenAI Responses API
curl -X POST http://localhost:52415/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "input": "Hello", "max_output_tokens": 20}'

Streaming tests:

# Chat Completions API (streaming)
curl -N -X POST http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "stream": true, "max_tokens": 20}'

# Claude Messages API (streaming)
curl -N -X POST http://localhost:52415/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}], "stream": true}'

# OpenAI Responses API (streaming)
curl -N -X POST http://localhost:52415/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "input": "Hello", "stream": true, "max_output_tokens": 20}'

All endpoints tested successfully with proper response formats and streaming events.

Automated Testing

74 tests in src/exo/master/tests/ all pass
Type checker (basedpyright) passes with 0 errors

🤖 Generated with Claude Code

Jan 16 '26 12:01 AlexCheema

do we really want to funnel everything through ChatCompletions?

Jan 16 '26 12:01 Evanev7

do we really want to funnel everything through ChatCompletions?

Maybe not. Would the alternative be to introduce a new intermediary type and have adapters for OpenAI completions, OpenAI responses and Claude APIs?

Jan 16 '26 15:01 AlexCheema

After thinking about it a bit, the responses API is probably the thing to centralise on since its the most featureful

Jan 16 '26 15:01 Evanev7

After thinking about it a bit, the responses API is probably the thing to centralise on since its the most featureful

Agreed. I changed it now so we use responses as the canonical API / types

Jan 19 '26 11:01 AlexCheema

Just some thoughts in it's current state - this feels like a lot of boilerplate to add features like #1181. We could perhaps improve that boilerplate with some dedicated from() and to() methods, rather than reimplementing the streams themselves. I'd like to think on this some more before we ship this - perhaps something like leo's old proxy server may be more apt in the short term.

Jan 19 '26 21:01 Evanev7

Have you had a chance to think about this as that is the bottleneck as far as I can tell @Evanev7

Jan 22 '26 13:01 AlexCheema