factorio-learning-environment icon indicating copy to clipboard operation
factorio-learning-environment copied to clipboard

Implement OpenAI-Compatible Client for vLLM and Ollama (#191)

Open Josephrp opened this issue 7 months ago • 3 comments

Objective: Develop a Python client library within the factorio-learning-environment repository that provides an OpenAI-compatible interface for interacting with vLLM and Ollama APIs, as outlined in the task list provided in issue #191. The client should abstract differences between vLLM and Ollama, offering a unified interface where possible while supporting platform-specific features and parameters.

Related Issue: #191


Task List

  • [ ] Design Client Interface:
    • Define OpenAICompatibleClient class to mimic openai.OpenAI interface.
    • Identify common endpoints (Completions, Chat Completions, Embeddings) and platform-specific endpoints (vLLM: Tokenizer, Classification; Ollama: Model Management).
    • Design handling of platform-specific parameters (vLLM: top_k; Ollama: num_ctx).
  • [ ] Create Configuration Schema:
    • Define a pydantic-based configuration model for client settings (base_url, api_key, model, platform).
    • Include platform-specific options (vLLM: chat_template; Ollama: keep_alive).
  • [ ] Implement Base Client Class:
    • Create OpenAICompatibleClient in agents/utils/openai_compatible_client.py.
    • Implement initialization with configuration (base_url, api_key, platform).
    • Add factory method to instantiate vLLM or Ollama handlers based on platform.
  • [ ] Implement Common Endpoints:
    • Completions API (/v1/completions for vLLM, /api/generate for Ollama):
      • [ ] Implement completions.create method.
      • [ ] Handle parameters (model, prompt, stream, platform-specific options).
      • [ ] Normalize responses to OpenAI schema (choices, usage).
    • Chat Completions API (/v1/chat/completions for vLLM, /api/chat for Ollama):
      • [ ] Implement chat.completions.create method.
      • [ ] Support messages, tools, stream, and platform-specific parameters.
      • [ ] Handle multi-modal inputs (images for Ollama’s llava, vLLM’s VLM2Vec).
      • [ ] Normalize streaming and non-streaming responses.
    • Embeddings API (/v1/embeddings for vLLM, /api/embed for Ollama):
      • [ ] Implement embeddings.create method.
      • [ ] Support input (text/messages) and model.
      • [ ] Handle platform-specific parameters (vLLM: chat_template; Ollama: truncate).
  • [ ] Implement vLLM-Specific Endpoints:
    • [ ] Tokenizer API (/tokenize, /detokenize): Implement tokenizer.encode and tokenizer.decode.
    • [ ] Pooling API (/pooling): Implement pooling.create for encoding prompts.
    • [ ] Classification API (/classify): Implement classification.create for text classification.
    • [ ] Score API (/score): Implement score.create for sentence pair scoring.
    • [ ] Re-rank API (/rerank, /v1/rerank, /v2/rerank): Implement rerank.create for relevance scoring.
    • [ ] Transcriptions API (/v1/audio/transcriptions): Implement audio.transcriptions.create for ASR models.
  • [ ] Implement Ollama-Specific Endpoints:
    • [ ] Create Model (/api/create): Implement models.create.
    • [ ] List Local Models (/api/tags): Implement models.list.
    • [ ] Show Model Information (/api/show): Implement models.info.
    • [ ] Copy Model (/api/copy): Implement models.copy.
    • [ ] Delete Model (/api/delete): Implement models.delete.
    • [ ] Pull Model (/api/pull): Implement models.pull.
    • [ ] Push Model (/api/push): Implement models.push.
    • [ ] Check Blob Exists (/api/blobs/:digest): Implement blobs.check.
    • [ ] Push Blob (/api/blobs/:digest): Implement blobs.push.
    • [ ] List Running Models (/api/ps): Implement models.running.
    • [ ] Version (/api/version): Implement version.
    • [ ] Legacy Embeddings (/api/embeddings): Support deprecated endpoint.
  • [ ] Handle Platform-Specific Parameters:
    • [ ] Support vLLM’s extra_body (e.g., top_k, guided_choice) and extra_headers.
    • [ ] Support Ollama’s options (e.g., num_ctx, seed) and format.
    • [ ] Map OpenAI parameters to platform-specific equivalents.
  • [ ] Implement Streaming Support:
    • [ ] Handle streaming for Completions and Chat Completions using requests with stream=True.
    • [ ] Parse and yield JSON objects incrementally in OpenAI-compatible format.
  • [ ] Handle Multi-Modal Inputs:
    • [ ] Support image inputs (base64-encoded) for vLLM (VLM2Vec) and Ollama (llava).
    • [ ] Validate and encode image data in requests.
  • [ ] Error Handling and Validation:
    • [ ] Implement HTTP error handling (400, 404, 500).
    • [ ] Use pydantic for input validation.
    • [ ] Handle platform-specific errors (e.g., vLLM’s missing chat template, Ollama’s model not found).
  • [ ] Implement Response Normalization:
    • [ ] Normalize vLLM and Ollama responses to OpenAI schemas (choices, usage, created).
    • [ ] Map vLLM’s data and Ollama’s response/message to choices.

Acceptance Criteria

  • [ ] Client supports all documented vLLM and Ollama endpoints with OpenAI-compatible interfaces.
  • [ ] Common endpoints (Completions, Chat Completions, Embeddings) work across providers.
  • [ ] Platform-specific endpoints are accessible via intuitive methods.
  • [ ] Multi-modal inputs (e.g., images) are supported where applicable.
  • [ ] Client is compatible with openai.OpenAI with minimal code changes.

Example Usage

from agents.utils.openai_compatible_client import OpenAICompatibleClient

# vLLM client
vllm_client = OpenAICompatibleClient(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
    platform="vllm"
)
response = vllm_client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"top_k": 50}
)
print(response.choices[0].message.content)

# Ollama client
ollama_client = OpenAICompatibleClient(
    base_url="http://localhost:11434",
    platform="ollama"
)
response = ollama_client.completions.create(
    model="llama3.2",
    prompt="Why is the sky blue?",
    stream=False,
    options={"seed": 123}
)
print(response.choices[0].text)

# List Ollama models
models = ollama_client.models.list()
print([model["name"] for model in models["models"]])

Notes

  • vLLM Limitations: Handle unsupported parameters (e.g., suffix, parallel_tool_calls) with warnings or errors.
  • Ollama Limitations: Account for deprecated /api/embeddings and context parameter.
  • Performance: Optimize for high QPS, considering vLLM’s X-Request-Id warning.
  • Extensibility: Design for future platform additions.
  • Integration with Existing Code: Ensure compatibility with LLMFactory in agents/utils/llm_factory.py, particularly for image support and message formatting.

Josephrp avatar May 21 '25 14:05 Josephrp

https://ollama.com/blog/openai-compatibility

Have you seen this? It seems that we get some compatibility for free.

JackHopkins avatar May 21 '25 14:05 JackHopkins

yes of course , made a bunch of these already , it's really great , we'll also basically get huggingface inference client compatibility too (but i didnt want to put that in) and that's the one i'm interested in ;-)

Josephrp avatar May 21 '25 14:05 Josephrp

Huggingface compatibility would be really cool!

JackHopkins avatar May 21 '25 14:05 JackHopkins