exo icon indicating copy to clipboard operation
exo copied to clipboard

feat: add Claude Messages API and OpenAI Responses API support

Open AlexCheema opened this issue 1 month ago • 3 comments

Motivation

Add support for Claude Messages API and OpenAI Responses API to allow users to interact with exo using these popular API formats. This enables broader compatibility with existing tooling and SDKs that expect these API formats.

Architecture

The API layer uses OpenAI Responses API as the canonical internal format, with Chat Completions and Claude Messages as adapters on top. The Responses API is the most featureful, making it the natural choice for the internal format.

Responses Request → [native] → InternalParams → Runner → TokenChunk → ResponsesResponse
Chat Completions → [adapter] → InternalParams → Runner → TokenChunk → ChatCompletionResponse  
Claude Messages  → [adapter] → InternalParams → Runner → TokenChunk → ClaudeMessagesResponse

All three endpoints now follow the same uniform pattern using adapters.

Changes

New Files

  • src/exo/shared/types/claude_api.py - Pydantic types for Claude Messages API
  • src/exo/shared/types/openai_responses.py - Pydantic types for OpenAI Responses API
  • src/exo/master/adapters/chat_completions.py - Chat Completions adapter (streaming/non-streaming)
  • src/exo/master/adapters/claude.py - Claude Messages adapter (streaming/non-streaming)
  • src/exo/master/adapters/responses.py - OpenAI Responses adapter (streaming/non-streaming)

Modified Files

  • src/exo/master/api.py - Refactored to use adapters uniformly for all endpoints

New Endpoints

  • POST /v1/messages - Claude Messages API (streaming and non-streaming)
  • POST /v1/responses - OpenAI Responses API (streaming and non-streaming)

Why It Works

All APIs are implemented as pure conversion adapters:

  1. Incoming requests are converted to internal ChatCompletionTaskParams
  2. The existing ChatCompletion command flow is used unchanged
  3. TokenChunk events are converted back to API-specific response formats via adapters

This approach ensures no changes to core inference logic while supporting different API conventions.

Streaming Formats

  • Chat Completions: Uses data: {...}\n\n with [DONE] terminator
  • Claude: Uses event types message_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop
  • OpenAI Responses: Uses event types response.created, response.in_progress, response.output_item.added, response.content_part.added, response.output_text.delta, response.output_text.done, response.content_part.done, response.output_item.done, response.completed

Test Plan

Manual Testing

Hardware: MacBook Pro M3 Max

Non-streaming tests:

# Chat Completions API
curl -X POST http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 20}'

# Claude Messages API
curl -X POST http://localhost:52415/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}]}'

# OpenAI Responses API
curl -X POST http://localhost:52415/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "input": "Hello", "max_output_tokens": 20}'

Streaming tests:

# Chat Completions API (streaming)
curl -N -X POST http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "stream": true, "max_tokens": 20}'

# Claude Messages API (streaming)
curl -N -X POST http://localhost:52415/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}], "stream": true}'

# OpenAI Responses API (streaming)
curl -N -X POST http://localhost:52415/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "input": "Hello", "stream": true, "max_output_tokens": 20}'

All endpoints tested successfully with proper response formats and streaming events.

Automated Testing

  • 74 tests in src/exo/master/tests/ all pass
  • Type checker (basedpyright) passes with 0 errors

🤖 Generated with Claude Code

AlexCheema avatar Jan 16 '26 12:01 AlexCheema

do we really want to funnel everything through ChatCompletions?

Evanev7 avatar Jan 16 '26 12:01 Evanev7

do we really want to funnel everything through ChatCompletions?

Maybe not. Would the alternative be to introduce a new intermediary type and have adapters for OpenAI completions, OpenAI responses and Claude APIs?

AlexCheema avatar Jan 16 '26 15:01 AlexCheema

After thinking about it a bit, the responses API is probably the thing to centralise on since its the most featureful

Evanev7 avatar Jan 16 '26 15:01 Evanev7

After thinking about it a bit, the responses API is probably the thing to centralise on since its the most featureful

Agreed. I changed it now so we use responses as the canonical API / types

AlexCheema avatar Jan 19 '26 11:01 AlexCheema

Just some thoughts in it's current state - this feels like a lot of boilerplate to add features like #1181. We could perhaps improve that boilerplate with some dedicated from() and to() methods, rather than reimplementing the streams themselves. I'd like to think on this some more before we ship this - perhaps something like leo's old proxy server may be more apt in the short term.

Evanev7 avatar Jan 19 '26 21:01 Evanev7

Have you had a chance to think about this as that is the bottleneck as far as I can tell @Evanev7

AlexCheema avatar Jan 22 '26 13:01 AlexCheema