WasmEdge LFX mentorship (2025/term3): Support the Responses API in Llama Nexus

Project Title

Support the Responses API in Llama Nexus

Description

The llama nexus project is an API proxy to provide OpenAI-compatible and unified API endpoints for multiple downstream API servers, including LLamaEdge API servers running open-source LLMs.

https://github.com/LlamaEdge/llama-nexus

Currently, the Llama Nexus supports the stateless /chat/completions API endpoint for LLMs. We would like to expand this to support the /responses stateful API from OpenAI as well.

https://platform.openai.com/docs/api-reference/responses

https://platform.openai.com/docs/guides/responses-vs-chat-completions

In particular, we aim to implement support for

MCP
Code interpreter
Web search
File search
Browser use (optional)

Expected Outcome

New features for the Llama Nexus proxy server.

Recommend skills

Rust
OpenAI API
MCP Rust SDK

Pre-tests

1 Fork the llama nexus project. If you wish to make the forked repo private, add @juntao as a collaborator.

2 Implement the simplest support for a /responses API -- that is to construct the complete system prompt and chat history inside llama nexus for every user request. You could use any database to store the history and context for each chat session.

3 Provide docs and demo to show it works.

Mentor(s)

Michael Yuan, @juntao [email protected] Sam Liu, @apepkuss [email protected]

Apply Link

https://mentorship.lfx.linuxfoundation.org/project/31044818-fe9d-478d-b740-5d4c8a4c49c2

Appendix

No response

Jul 29 '25 16:07 juntao

Hello @juntao , I am interested in working on this project as a part of LFX mentorship programme. I am currently interning at Open Science Labs , where I co-maintain ArxLang and work on IRx, a compiler that translates ASTx to LLVM IR involving type system extensions and transformation pipelines.

Jul 31 '25 14:07 yuvimittal

Hi @juntao @apepkuss,

I'm interested in this project but can you share some resource for learning what exactly is llama nexus or LlamaEdge? gaining a better understanding can prove helpful in understanding the project expectations & outcomes better.

Jul 31 '25 21:07 ashish-dalal

Hi @juntao @apepkuss , from above description i understood that we have to make /responses endpoint in llama nexus for using mcp , web search, file search , etc. which cannot be done through /chat-completion api , should i create a demo api structure and my strategy to you for complete guidance , I know mcp and agentic ai , I would like to use my knowledge to help you with this project. Should I also contribute to this repo, for standing out from others. Or just concentrate on my demo api structure/ proposal.

Aug 01 '25 02:08 alokdangre

Hi @yuvi-mittal @ashish-dalal and @alokdangre please see the updated "pretest" section. Thank you

Aug 08 '25 17:08 juntao

Hii @juntao i have completed the Pretest. I have expanded the llamaNexus to support the /responses api request from openai. below is a ScreenShot of the feature working. I will soon upload a demo video of working of the /response feature.

Aug 12 '25 05:08 Gmin2

Hii @juntao I want to ask that here in the deliverables we have to expand llama-nexus support to the /responses for code interpreter. As per my understanding are you mentioning the cardea-github( https://github.com/cardea-mcp/cardea-mcp-servers/tree/main/cardea-github ) and github pr review ( https://github.com/flows-network/github-pr-review ) as the code-interpreter. Also with the file search you are mentioning about the cardea-tidb ( https://github.com/cardea-mcp/cardea-mcp-servers/tree/main/cardea-tidb ) , cardea-agentic-search ( https://github.com/cardea-mcp/cardea-mcp-servers/tree/main/cardea-agentic-search ) and some other MCP servers which are used in the searching in a Knowledge base. for the web search you are mentioning about the cardea-web-search ( https://github.com/cardea-mcp/cardea-mcp-servers/tree/main/cardea-web-search ) Also if you can give some information about the know browser-use one. Am i going in the right direction?

Aug 12 '25 16:08 Gmin2

Hi @juntao

I have completed the pre-test and I'd like to share with you how I implemented the stateful /responses endpoint https://drive.google.com/file/d/1b7VN97B4p9N0QpF4t-qhoAhDNvo0ka4K/view?usp=sharing

Overview

So I made a stateful conversation management endpoint /responses in a forked llama-nexus repo. So now we can have a persistent conversation history that is stored in a database and hence doesn't need the entire conversation to like in the /chat/completion endpoint. I have also ensured compatibility with the downstream LLM backend servers.

Directory Structure

The implementation follows a clean modular architecture under the sub-directory src/responses/:

  src/responses/
  ├── mod.rs          # for module exports and public api
  ├── models.rs       # this has data structures for requests/responses/sessions
  ├── db.rs          # I used a SQLite database operation along with thread safety
  └── handlers.rs    # this has HTTP request handlers and backend integration

File Breakdown

src/responses/models.rs

So we have openAI compatible data structures like ResponseRequest, ResponseReply, OutputItem, ContentItem etc.
For internal conversation management we have Session, SessionMessage structs for which I used Hashmap for message storage
It also takes care of reconstruction of conversation thread
I have used a primitive token estimation by taking 4chars=1token

src/responses/db.rs

I have used Mutex<Connection> for thread safe concurrent access
it has json serialization of conversation session for session persistence
older session can be found by response_id for conversation chaining
complete CRUD based lifecycle management

src/responses/handlers.rs

this has the main endpoint logic responses_handler that will process stateful conversation requests
I have used the existing llama nexus server selection and routing for aligning with the existing structure and modules
it also converts conversation history to openAI chat completion format
features error handling as well

src/responses/mod.rs

export only those public api that are needed
has encapsulation keeping internal details private

API Flow & Architecture

Request Processing Flow

load existing conversation or create a new session
add user given input to the session's conversation history
converts this session history to chat completion messages
backend routing using existing llama nexus server selection
response from LLM is added to session and database for persistence
return response in openAI's /responses format

Key Architectural Decisions

Hybrid Architecture

new /responses endpoint work alongside existing /chat/completions endpoint
no disruption in existing stateless functionality

Backend Agnostic Design

The backend has universal compatibility with any registered llama nexus backend
reuse existing server selection, load balancing, and failover

Smart State Management

JSON serialization based schema-less storage as it provides flexibility
thread safety using mutex
response chaining which is made possible by enabling response_id
uses indexed hash map for conversation order

LLM Backend Integration

Backend Compatibility It works with any LLM server registered in llama nexus like:

local models: ollama, llamafile for privacy focused use cases (I have myself used
cloud apis: openai, anthropic, openrouter for high performance requirements
custom deployments

Request Routing Process

  // convert conversation -> chat completion format -> existing routing
  let chat_request = ChatCompletionRequest {
      model: Some(model.clone()),
      messages: converted_conversation,
      user: Some("responses-api".to_string()),
      stream: Some(false),
      ..Default::default()
  };

  // route through existing
  let chat_result = call_chat_backend(&state.main_state, chat_request).await?;

Compatibility with OpenAI

API Matching this implementation produces identical/similar json structure to openai's responses api:

  {
    "id": "resp_67ccd2bed1ec8190b14f964abc0542670bb6a6b452d3795b",
    "object": "response",
    "created_at": 1741476542,
    "status": "completed",
    "model": "llama3.1-8b",
    "output": [...],
    "usage": {
      "input_tokens": 36,
      "output_tokens": 87,
      "total_tokens": 123
    },
    "previous_response_id": "resp_previous_response_id"
  }

Technical Implementation Highlights

Database Schema

has flexible json storage for schema flexibility
uses primary key indexing on response_id for faster lookups
conversation search through sessions to find by response_id

Error Handling

coverage for database errors, backend failures, parsing issues
http status codes with descriptive error messages

Integration & Deployment Main Application Integration endpoint registered in main.rs alongside existing routes:

  let responses_router = Router::new()
      .route("/v1/responses", post(responses::responses_handler))
      .route("/health", get(responses::health_handler))
      .with_state(responses_state);

Database Initialization I used sqlite database (sessions.db) which is automatically created on startup with proper schema

State Management dedicated appstate for responses while also sharing main application state for backend routing

Future Enhancements

This code can be further refined and made better through the following:

mcp tool integration: code interpreter, web search, file search capabilities (as mentioned in the project outcomes/expectation)
performance optimization: conversation compression, lru caching (a few ideas on this have been discussed in my proposal)
advanced features: conversation branching (enabled through previous_response_id), context window management, export/import
scalability: postgresql migration

Current progress

core /v1/responses endpoint implemented
sqlite conversation persistence is tested and working
openai api compatibility
integrated with existing llama nexus infrastructure and tested on local ollama phi3:mini model
implemented thread-safe concurrent operations
has error handling

this implementation of mine successfully bridges stateless chat completions with stateful conversation management while maintaining llama nexus's backend-agnostic flexibility.

Looking forward to your feedback!

Aug 12 '25 18:08 ashish-dalal

I have added you as a collaborator @juntao to my private forked repo of llama-nexus. you can find the code there. please let me know any feedback you might have for me to improve upon https://github.com/ashish-dalal/llama-nexus-lfx

Aug 12 '25 18:08 ashish-dalal

Hi @juntao

I have added code interpreter functionality as well. Currently it has the following,

Docker based sandbox to execute the code (currently supports only python language, but can be extended to other languages by including appropriate docker images)
LLM writes the code in ```python code blocks which is then detected using regex and is then executed in the docker sandbox
the entire conversation along with tool use is tracked within the conversation thread

Directory Structure

src/
  ├── responses/
  │   ├── mod.rs
  │   ├── models.rs
  │   ├── db.rs
  │   ├── handlers.rs
  │   └── code_interpreter/
  │       ├── mod.rs
  │       ├── python_session.rs    # Manages Docker containers per response
  │       ├── executor.rs          # Code execution + result capture
  │       └── resource_manager.rs  # Config-driven resource limits
  ├── config.toml                  # Resource limits configuration

I'll share more comprehensive documentation and a video demonstration on it by this evening. Please share your feedback

Here are the screenshots,

api request:
code that is identified for execution:

Aug 13 '25 00:08 ashish-dalal

Hi @juntao

I have made a video demonstration on the code interpreter. it can detect a python code enclosed ```python code blocks. Demo video: https://drive.google.com/file/d/1YhPrzxTN_W0Yp1CFt1BbrdBuJs_4Yok3/view?usp=sharing

Below I have attached some screenshots,

The LLM is given a task in a prompt and is instructed to use python language for determining the answer. It then invokes the code interpreter the output from which is shared with the LLM. All these exchanges are shared in the conversation thread.

Prompt-1: calculate prime numbers between 1 and 50 using python.

Prompt-2: calculate the first 10 fibonacci numbers using python

Prompt-3: calculate exp(5) upto 5 decimal places using python

Execution Flow

Below is a flowchart showing the back-and-forth execution flow between LLM and code interpreter

flowchart TD
      A[User Request] --> B[LLM Response]
      B --> C{Code Detected?}

      C -->|No| D[Final Response]
      C -->|\```python| E[Docker Execute]

      E --> F[Tool Result]
      F --> G[LLM + Tool Context]
      G --> H{More Code Needed?}

      H -->|Yes| E
      H -->|No| D

      D --> I[(Database)]

      style A fill:#6200ea,color:#ffffff
      style B fill:#00bcd4,color:#000000
      style E fill:#ff5722,color:#ffffff
      style F fill:#4caf50,color:#ffffff
      style G fill:#ff9800,color:#000000
      style D fill:#9c27b0,color:#ffffff
      style I fill:#424242,color:#ffffff

Aug 13 '25 17:08 ashish-dalal

Hi @juntao,

I have completed the pretest and expanded llama-nexus to support the /responses API. Here’s a video demonstration of it working: Video demonstration

Currently, the /responses endpoint supports persistent conversation history with full OpenAI-compatible response structure and integrates seamlessly with registered downstream servers.

I’m now working on implementing browser-use functionality to extend tool capabilities within the /responses flow. I’ll share updates and demos as I make progress.

Aug 15 '25 15:08 yuvimittal

Moved to #4374

Sep 15 '25 21:09 ashish-dalal