LFX mentorship (2025/term3): Support the Responses API in Llama Nexus
Project Title
Support the Responses API in Llama Nexus
Description
The llama nexus project is an API proxy to provide OpenAI-compatible and unified API endpoints for multiple downstream API servers, including LLamaEdge API servers running open-source LLMs.
https://github.com/LlamaEdge/llama-nexus
Currently, the Llama Nexus supports the stateless /chat/completions API endpoint for LLMs. We would like to expand this to support the /responses stateful API from OpenAI as well.
https://platform.openai.com/docs/api-reference/responses
https://platform.openai.com/docs/guides/responses-vs-chat-completions
In particular, we aim to implement support for
- MCP
- Code interpreter
- Web search
- File search
- Browser use (optional)
Expected Outcome
New features for the Llama Nexus proxy server.
Recommend skills
- Rust
- OpenAI API
- MCP Rust SDK
Pre-tests
1 Fork the llama nexus project. If you wish to make the forked repo private, add @juntao as a collaborator.
2 Implement the simplest support for a /responses API -- that is to construct the complete system prompt and chat history inside llama nexus for every user request. You could use any database to store the history and context for each chat session.
3 Provide docs and demo to show it works.
Mentor(s)
Michael Yuan, @juntao [email protected] Sam Liu, @apepkuss [email protected]
Apply Link
https://mentorship.lfx.linuxfoundation.org/project/31044818-fe9d-478d-b740-5d4c8a4c49c2
Appendix
No response
Hello @juntao , I am interested in working on this project as a part of LFX mentorship programme. I am currently interning at Open Science Labs , where I co-maintain ArxLang and work on IRx, a compiler that translates ASTx to LLVM IR involving type system extensions and transformation pipelines.
Hi @juntao @apepkuss,
I'm interested in this project but can you share some resource for learning what exactly is llama nexus or LlamaEdge? gaining a better understanding can prove helpful in understanding the project expectations & outcomes better.
Hi @juntao @apepkuss , from above description i understood that we have to make /responses endpoint in llama nexus for using mcp , web search, file search , etc. which cannot be done through /chat-completion api , should i create a demo api structure and my strategy to you for complete guidance , I know mcp and agentic ai , I would like to use my knowledge to help you with this project. Should I also contribute to this repo, for standing out from others. Or just concentrate on my demo api structure/ proposal.
Hi @yuvi-mittal @ashish-dalal and @alokdangre please see the updated "pretest" section. Thank you
Hii @juntao
i have completed the Pretest. I have expanded the llamaNexus to support the /responses api request from openai.
below is a ScreenShot of the feature working. I will soon upload a demo video of working of the /response feature.
Hii @juntao I want to ask that here in the deliverables we have to expand llama-nexus support to the /responses for code interpreter. As per my understanding are you mentioning the cardea-github( https://github.com/cardea-mcp/cardea-mcp-servers/tree/main/cardea-github ) and github pr review ( https://github.com/flows-network/github-pr-review ) as the code-interpreter. Also with the file search you are mentioning about the cardea-tidb ( https://github.com/cardea-mcp/cardea-mcp-servers/tree/main/cardea-tidb ) , cardea-agentic-search ( https://github.com/cardea-mcp/cardea-mcp-servers/tree/main/cardea-agentic-search ) and some other MCP servers which are used in the searching in a Knowledge base. for the web search you are mentioning about the cardea-web-search ( https://github.com/cardea-mcp/cardea-mcp-servers/tree/main/cardea-web-search ) Also if you can give some information about the know browser-use one. Am i going in the right direction?
Hi @juntao
I have completed the pre-test and I'd like to share with you how I implemented the stateful /responses endpoint https://drive.google.com/file/d/1b7VN97B4p9N0QpF4t-qhoAhDNvo0ka4K/view?usp=sharing
Overview
So I made a stateful conversation management endpoint /responses in a forked llama-nexus repo. So now we can have a persistent conversation history that is stored in a database and hence doesn't need the entire conversation to like in the /chat/completion endpoint. I have also ensured compatibility with the downstream LLM backend servers.
Directory Structure
The implementation follows a clean modular architecture under the sub-directory src/responses/:
src/responses/
├── mod.rs # for module exports and public api
├── models.rs # this has data structures for requests/responses/sessions
├── db.rs # I used a SQLite database operation along with thread safety
└── handlers.rs # this has HTTP request handlers and backend integration
File Breakdown
src/responses/models.rs
- So we have openAI compatible data structures like ResponseRequest, ResponseReply, OutputItem, ContentItem etc.
- For internal conversation management we have
Session,SessionMessagestructs for which I used Hashmap for message storage - It also takes care of reconstruction of conversation thread
- I have used a primitive token estimation by taking 4chars=1token
src/responses/db.rs
- I have used Mutex<Connection> for thread safe concurrent access
- it has json serialization of conversation session for session persistence
- older session can be found by response_id for conversation chaining
- complete CRUD based lifecycle management
src/responses/handlers.rs
- this has the main endpoint logic responses_handler that will process stateful conversation requests
- I have used the existing llama nexus server selection and routing for aligning with the existing structure and modules
- it also converts conversation history to openAI chat completion format
- features error handling as well
src/responses/mod.rs
- export only those public api that are needed
- has encapsulation keeping internal details private
API Flow & Architecture
Request Processing Flow
- load existing conversation or create a new session
- add user given input to the session's conversation history
- converts this session history to chat completion messages
- backend routing using existing llama nexus server selection
- response from LLM is added to session and database for persistence
- return response in openAI's /responses format
Key Architectural Decisions
Hybrid Architecture
- new /responses endpoint work alongside existing /chat/completions endpoint
- no disruption in existing stateless functionality
Backend Agnostic Design
- The backend has universal compatibility with any registered llama nexus backend
- reuse existing server selection, load balancing, and failover
Smart State Management
- JSON serialization based schema-less storage as it provides flexibility
- thread safety using mutex
- response chaining which is made possible by enabling response_id
- uses indexed hash map for conversation order
LLM Backend Integration
Backend Compatibility It works with any LLM server registered in llama nexus like:
-
local models: ollama, llamafile for privacy focused use cases (I have myself used
-
cloud apis: openai, anthropic, openrouter for high performance requirements
-
custom deployments
Request Routing Process
// convert conversation -> chat completion format -> existing routing
let chat_request = ChatCompletionRequest {
model: Some(model.clone()),
messages: converted_conversation,
user: Some("responses-api".to_string()),
stream: Some(false),
..Default::default()
};
// route through existing
let chat_result = call_chat_backend(&state.main_state, chat_request).await?;
Compatibility with OpenAI
API Matching this implementation produces identical/similar json structure to openai's responses api:
{
"id": "resp_67ccd2bed1ec8190b14f964abc0542670bb6a6b452d3795b",
"object": "response",
"created_at": 1741476542,
"status": "completed",
"model": "llama3.1-8b",
"output": [...],
"usage": {
"input_tokens": 36,
"output_tokens": 87,
"total_tokens": 123
},
"previous_response_id": "resp_previous_response_id"
}
Technical Implementation Highlights
Database Schema
- has flexible json storage for schema flexibility
- uses primary key indexing on response_id for faster lookups
- conversation search through sessions to find by response_id
Error Handling
- coverage for database errors, backend failures, parsing issues
- http status codes with descriptive error messages
Integration & Deployment Main Application Integration endpoint registered in main.rs alongside existing routes:
let responses_router = Router::new()
.route("/v1/responses", post(responses::responses_handler))
.route("/health", get(responses::health_handler))
.with_state(responses_state);
Database Initialization I used sqlite database (sessions.db) which is automatically created on startup with proper schema
State Management dedicated appstate for responses while also sharing main application state for backend routing
Future Enhancements
This code can be further refined and made better through the following:
- mcp tool integration: code interpreter, web search, file search capabilities (as mentioned in the project outcomes/expectation)
- performance optimization: conversation compression, lru caching (a few ideas on this have been discussed in my proposal)
- advanced features: conversation branching (enabled through previous_response_id), context window management, export/import
- scalability: postgresql migration
Current progress
- core /v1/responses endpoint implemented
- sqlite conversation persistence is tested and working
- openai api compatibility
- integrated with existing llama nexus infrastructure and tested on local ollama phi3:mini model
- implemented thread-safe concurrent operations
- has error handling
this implementation of mine successfully bridges stateless chat completions with stateful conversation management while maintaining llama nexus's backend-agnostic flexibility.
Looking forward to your feedback!
I have added you as a collaborator @juntao to my private forked repo of llama-nexus. you can find the code there. please let me know any feedback you might have for me to improve upon https://github.com/ashish-dalal/llama-nexus-lfx
Hi @juntao
I have added code interpreter functionality as well. Currently it has the following,
- Docker based sandbox to execute the code (currently supports only python language, but can be extended to other languages by including appropriate docker images)
- LLM writes the code in ```python code blocks which is then detected using regex and is then executed in the docker sandbox
- the entire conversation along with tool use is tracked within the conversation thread
Directory Structure
src/
├── responses/
│ ├── mod.rs
│ ├── models.rs
│ ├── db.rs
│ ├── handlers.rs
│ └── code_interpreter/
│ ├── mod.rs
│ ├── python_session.rs # Manages Docker containers per response
│ ├── executor.rs # Code execution + result capture
│ └── resource_manager.rs # Config-driven resource limits
├── config.toml # Resource limits configuration
I'll share more comprehensive documentation and a video demonstration on it by this evening. Please share your feedback
Here are the screenshots,
-
api request:
-
code that is identified for execution:
Hi @juntao
I have made a video demonstration on the code interpreter. it can detect a python code enclosed ```python code blocks. Demo video: https://drive.google.com/file/d/1YhPrzxTN_W0Yp1CFt1BbrdBuJs_4Yok3/view?usp=sharing
Below I have attached some screenshots,
The LLM is given a task in a prompt and is instructed to use python language for determining the answer. It then invokes the code interpreter the output from which is shared with the LLM. All these exchanges are shared in the conversation thread.
Prompt-1: calculate prime numbers between 1 and 50 using python.
Prompt-2: calculate the first 10 fibonacci numbers using python
Prompt-3: calculate exp(5) upto 5 decimal places using python
Execution Flow
Below is a flowchart showing the back-and-forth execution flow between LLM and code interpreter
flowchart TD
A[User Request] --> B[LLM Response]
B --> C{Code Detected?}
C -->|No| D[Final Response]
C -->|\```python| E[Docker Execute]
E --> F[Tool Result]
F --> G[LLM + Tool Context]
G --> H{More Code Needed?}
H -->|Yes| E
H -->|No| D
D --> I[(Database)]
style A fill:#6200ea,color:#ffffff
style B fill:#00bcd4,color:#000000
style E fill:#ff5722,color:#ffffff
style F fill:#4caf50,color:#ffffff
style G fill:#ff9800,color:#000000
style D fill:#9c27b0,color:#ffffff
style I fill:#424242,color:#ffffff
Hi @juntao,
I have completed the pretest and expanded llama-nexus to support the /responses API. Here’s a video demonstration of it working: Video demonstration
Currently, the /responses endpoint supports persistent conversation history with full OpenAI-compatible response structure and integrates seamlessly with registered downstream servers.
I’m now working on implementing browser-use functionality to extend tool capabilities within the /responses flow. I’ll share updates and demos as I make progress.
Moved to #4374