feat: add Claude Messages API and OpenAI Responses API support
Motivation
Add support for Claude Messages API and OpenAI Responses API to allow users to interact with exo using these popular API formats. This enables broader compatibility with existing tooling and SDKs that expect these API formats.
Architecture
The API layer uses OpenAI Responses API as the canonical internal format, with Chat Completions and Claude Messages as adapters on top. The Responses API is the most featureful, making it the natural choice for the internal format.
Responses Request → [native] → InternalParams → Runner → TokenChunk → ResponsesResponse
Chat Completions → [adapter] → InternalParams → Runner → TokenChunk → ChatCompletionResponse
Claude Messages → [adapter] → InternalParams → Runner → TokenChunk → ClaudeMessagesResponse
All three endpoints now follow the same uniform pattern using adapters.
Changes
New Files
-
src/exo/shared/types/claude_api.py- Pydantic types for Claude Messages API -
src/exo/shared/types/openai_responses.py- Pydantic types for OpenAI Responses API -
src/exo/master/adapters/chat_completions.py- Chat Completions adapter (streaming/non-streaming) -
src/exo/master/adapters/claude.py- Claude Messages adapter (streaming/non-streaming) -
src/exo/master/adapters/responses.py- OpenAI Responses adapter (streaming/non-streaming)
Modified Files
-
src/exo/master/api.py- Refactored to use adapters uniformly for all endpoints
New Endpoints
-
POST /v1/messages- Claude Messages API (streaming and non-streaming) -
POST /v1/responses- OpenAI Responses API (streaming and non-streaming)
Why It Works
All APIs are implemented as pure conversion adapters:
- Incoming requests are converted to internal
ChatCompletionTaskParams - The existing
ChatCompletioncommand flow is used unchanged -
TokenChunkevents are converted back to API-specific response formats via adapters
This approach ensures no changes to core inference logic while supporting different API conventions.
Streaming Formats
-
Chat Completions: Uses
data: {...}\n\nwith[DONE]terminator -
Claude: Uses event types
message_start,content_block_start,content_block_delta,content_block_stop,message_delta,message_stop -
OpenAI Responses: Uses event types
response.created,response.in_progress,response.output_item.added,response.content_part.added,response.output_text.delta,response.output_text.done,response.content_part.done,response.output_item.done,response.completed
Test Plan
Manual Testing
Hardware: MacBook Pro M3 Max
Non-streaming tests:
# Chat Completions API
curl -X POST http://localhost:52415/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 20}'
# Claude Messages API
curl -X POST http://localhost:52415/v1/messages \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}]}'
# OpenAI Responses API
curl -X POST http://localhost:52415/v1/responses \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "input": "Hello", "max_output_tokens": 20}'
Streaming tests:
# Chat Completions API (streaming)
curl -N -X POST http://localhost:52415/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "stream": true, "max_tokens": 20}'
# Claude Messages API (streaming)
curl -N -X POST http://localhost:52415/v1/messages \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}], "stream": true}'
# OpenAI Responses API (streaming)
curl -N -X POST http://localhost:52415/v1/responses \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "input": "Hello", "stream": true, "max_output_tokens": 20}'
All endpoints tested successfully with proper response formats and streaming events.
Automated Testing
- 74 tests in
src/exo/master/tests/all pass - Type checker (basedpyright) passes with 0 errors
🤖 Generated with Claude Code
do we really want to funnel everything through ChatCompletions?
do we really want to funnel everything through ChatCompletions?
Maybe not. Would the alternative be to introduce a new intermediary type and have adapters for OpenAI completions, OpenAI responses and Claude APIs?
After thinking about it a bit, the responses API is probably the thing to centralise on since its the most featureful
After thinking about it a bit, the responses API is probably the thing to centralise on since its the most featureful
Agreed. I changed it now so we use responses as the canonical API / types
Just some thoughts in it's current state - this feels like a lot of boilerplate to add features like #1181. We could perhaps improve that boilerplate with some dedicated from() and to() methods, rather than reimplementing the streams themselves. I'd like to think on this some more before we ship this - perhaps something like leo's old proxy server may be more apt in the short term.
Have you had a chance to think about this as that is the bottleneck as far as I can tell @Evanev7