Implement OpenAI-Compatible Client for vLLM and Ollama (#191)
Objective: Develop a Python client library within the factorio-learning-environment repository that provides an OpenAI-compatible interface for interacting with vLLM and Ollama APIs, as outlined in the task list provided in issue #191. The client should abstract differences between vLLM and Ollama, offering a unified interface where possible while supporting platform-specific features and parameters.
Related Issue: #191
Task List
- [ ] Design Client Interface:
- Define
OpenAICompatibleClientclass to mimicopenai.OpenAIinterface. - Identify common endpoints (Completions, Chat Completions, Embeddings) and platform-specific endpoints (vLLM: Tokenizer, Classification; Ollama: Model Management).
- Design handling of platform-specific parameters (vLLM:
top_k; Ollama:num_ctx).
- Define
- [ ] Create Configuration Schema:
- Define a
pydantic-based configuration model for client settings (base_url,api_key,model,platform). - Include platform-specific options (vLLM:
chat_template; Ollama:keep_alive).
- Define a
- [ ] Implement Base Client Class:
- Create
OpenAICompatibleClientinagents/utils/openai_compatible_client.py. - Implement initialization with configuration (
base_url,api_key,platform). - Add factory method to instantiate vLLM or Ollama handlers based on
platform.
- Create
- [ ] Implement Common Endpoints:
- Completions API (
/v1/completionsfor vLLM,/api/generatefor Ollama):- [ ] Implement
completions.createmethod. - [ ] Handle parameters (
model,prompt,stream, platform-specific options). - [ ] Normalize responses to OpenAI schema (
choices,usage).
- [ ] Implement
- Chat Completions API (
/v1/chat/completionsfor vLLM,/api/chatfor Ollama):- [ ] Implement
chat.completions.createmethod. - [ ] Support
messages,tools,stream, and platform-specific parameters. - [ ] Handle multi-modal inputs (images for Ollama’s
llava, vLLM’s VLM2Vec). - [ ] Normalize streaming and non-streaming responses.
- [ ] Implement
- Embeddings API (
/v1/embeddingsfor vLLM,/api/embedfor Ollama):- [ ] Implement
embeddings.createmethod. - [ ] Support
input(text/messages) andmodel. - [ ] Handle platform-specific parameters (vLLM:
chat_template; Ollama:truncate).
- [ ] Implement
- Completions API (
- [ ] Implement vLLM-Specific Endpoints:
- [ ] Tokenizer API (
/tokenize,/detokenize): Implementtokenizer.encodeandtokenizer.decode. - [ ] Pooling API (
/pooling): Implementpooling.createfor encoding prompts. - [ ] Classification API (
/classify): Implementclassification.createfor text classification. - [ ] Score API (
/score): Implementscore.createfor sentence pair scoring. - [ ] Re-rank API (
/rerank,/v1/rerank,/v2/rerank): Implementrerank.createfor relevance scoring. - [ ] Transcriptions API (
/v1/audio/transcriptions): Implementaudio.transcriptions.createfor ASR models.
- [ ] Tokenizer API (
- [ ] Implement Ollama-Specific Endpoints:
- [ ] Create Model (
/api/create): Implementmodels.create. - [ ] List Local Models (
/api/tags): Implementmodels.list. - [ ] Show Model Information (
/api/show): Implementmodels.info. - [ ] Copy Model (
/api/copy): Implementmodels.copy. - [ ] Delete Model (
/api/delete): Implementmodels.delete. - [ ] Pull Model (
/api/pull): Implementmodels.pull. - [ ] Push Model (
/api/push): Implementmodels.push. - [ ] Check Blob Exists (
/api/blobs/:digest): Implementblobs.check. - [ ] Push Blob (
/api/blobs/:digest): Implementblobs.push. - [ ] List Running Models (
/api/ps): Implementmodels.running. - [ ] Version (
/api/version): Implementversion. - [ ] Legacy Embeddings (
/api/embeddings): Support deprecated endpoint.
- [ ] Create Model (
- [ ] Handle Platform-Specific Parameters:
- [ ] Support vLLM’s
extra_body(e.g.,top_k,guided_choice) andextra_headers. - [ ] Support Ollama’s
options(e.g.,num_ctx,seed) andformat. - [ ] Map OpenAI parameters to platform-specific equivalents.
- [ ] Support vLLM’s
- [ ] Implement Streaming Support:
- [ ] Handle streaming for Completions and Chat Completions using
requestswithstream=True. - [ ] Parse and yield JSON objects incrementally in OpenAI-compatible format.
- [ ] Handle streaming for Completions and Chat Completions using
- [ ] Handle Multi-Modal Inputs:
- [ ] Support image inputs (base64-encoded) for vLLM (VLM2Vec) and Ollama (
llava). - [ ] Validate and encode image data in requests.
- [ ] Support image inputs (base64-encoded) for vLLM (VLM2Vec) and Ollama (
- [ ] Error Handling and Validation:
- [ ] Implement HTTP error handling (400, 404, 500).
- [ ] Use
pydanticfor input validation. - [ ] Handle platform-specific errors (e.g., vLLM’s missing chat template, Ollama’s model not found).
- [ ] Implement Response Normalization:
- [ ] Normalize vLLM and Ollama responses to OpenAI schemas (
choices,usage,created). - [ ] Map vLLM’s
dataand Ollama’sresponse/messagetochoices.
- [ ] Normalize vLLM and Ollama responses to OpenAI schemas (
Acceptance Criteria
- [ ] Client supports all documented vLLM and Ollama endpoints with OpenAI-compatible interfaces.
- [ ] Common endpoints (Completions, Chat Completions, Embeddings) work across providers.
- [ ] Platform-specific endpoints are accessible via intuitive methods.
- [ ] Multi-modal inputs (e.g., images) are supported where applicable.
- [ ] Client is compatible with
openai.OpenAIwith minimal code changes.
Example Usage
from agents.utils.openai_compatible_client import OpenAICompatibleClient
# vLLM client
vllm_client = OpenAICompatibleClient(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
platform="vllm"
)
response = vllm_client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
extra_body={"top_k": 50}
)
print(response.choices[0].message.content)
# Ollama client
ollama_client = OpenAICompatibleClient(
base_url="http://localhost:11434",
platform="ollama"
)
response = ollama_client.completions.create(
model="llama3.2",
prompt="Why is the sky blue?",
stream=False,
options={"seed": 123}
)
print(response.choices[0].text)
# List Ollama models
models = ollama_client.models.list()
print([model["name"] for model in models["models"]])
Notes
- vLLM Limitations: Handle unsupported parameters (e.g.,
suffix,parallel_tool_calls) with warnings or errors. - Ollama Limitations: Account for deprecated
/api/embeddingsandcontextparameter. - Performance: Optimize for high QPS, considering vLLM’s
X-Request-Idwarning. - Extensibility: Design for future platform additions.
- Integration with Existing Code: Ensure compatibility with
LLMFactoryinagents/utils/llm_factory.py, particularly for image support and message formatting.
https://ollama.com/blog/openai-compatibility
Have you seen this? It seems that we get some compatibility for free.
yes of course , made a bunch of these already , it's really great , we'll also basically get huggingface inference client compatibility too (but i didnt want to put that in) and that's the one i'm interested in ;-)
Huggingface compatibility would be really cool!