LitServe icon indicating copy to clipboard operation
LitServe copied to clipboard

Is it possible to support multiple endpoints for one server?

Open arkohut opened this issue 1 year ago • 21 comments

🚀 Feature

Multiple endpoints like /embedding or /vlm/predict or /ocr/predict.

Motivation

I would like to host multiple models on a single GPU for different purposes. It would be ideal to support numerous (small) models while maintaining high performance, such as through batching.

Additionally, I believe starting multiple litserve instances with different ports may introduce unnecessary complexity, compared to starting a single server with different endpoints.

Pitch

Alternatives

Additional context

arkohut avatar Sep 06 '24 07:09 arkohut

Hi @arkohut,

You can add an additional endpoint by implementing a LitSpec API, similar to the OpenAISpec. Currently, it only takes a single spec.

bhimrazy avatar Sep 06 '24 08:09 bhimrazy

Hi @aniketmaurya,

It seems litserve already handle multiple specs, but the worker setup currently accepts only a single one. Do we have plans to support multiple/array of specs?

for spec in self._specs:
    spec: LitSpec
    # TODO: check for path conflicts
    for path, endpoint, methods in spec.endpoints:
        self.app.add_api_route(
            path, endpoint=endpoint, methods=methods, dependencies=[Depends(self.setup_auth())]
        )

bhimrazy avatar Sep 06 '24 08:09 bhimrazy

Hi @arkohut,

You can add an additional endpoint by implementing a LitSpec API, similar to the OpenAISpec. Currently, it only takes a single spec.

Oh, sorry! It looks like we can currently only use one endpoint, either the default one or the one added from the spec.

Also, there are also some discussions about multiple endpoints in issue #90. Feel free to check it out!

bhimrazy avatar Sep 06 '24 08:09 bhimrazy

Thanks for the reply. The issue #90 is just talking about customize endpoint. I think it is quite necessary. For example, I need a openai compatible embedding endpoint which is not supported by litserve (which just support chat api).

But it is not talking about multiple endpoints....The only way right now is to expose multiple server with different ports.

arkohut avatar Sep 09 '24 06:09 arkohut

Hi @arkohut, agreed on the multi-endpoints feature, but not sure if it's in the plan. I did a quick hack for this, though it’s not perfect since the extra endpoints are isolated from the main litserve engine. Hope it helps!

# server.py

import litserve as ls
import numpy as np
from fastapi import Depends
from openai.types.embedding_create_params import EmbeddingCreateParams
from openai.types.create_embedding_response import (
    CreateEmbeddingResponse,
    Embedding,
    Usage,
)
from typing import Generator


class ChatAPI(ls.LitAPI):
    def setup(self, device: str) -> None:
        """Initialize the model and other required resources."""
        self.model = None  # Placeholder: Initialize or load your model here.

    def predict(self, prompt: str) -> Generator[str, None, None]:
        """Generator function to yield the model output step by step."""
        yield "This is a sample generated output"

    def encode_response(self, output: Generator[str, None, None]) -> Generator[dict, None, None]:
        """Format the response to fit the assistant's message structure."""
        for out in output:
            yield {"role": "assistant", "content": out}
        # Final token after finishing processing
        yield {"role": "assistant", "content": "This is the final msg."}


def embedding_fn(request: EmbeddingCreateParams) -> CreateEmbeddingResponse:
    """Generate a fake embedding for demonstration purposes."""
    # Placeholder: Cache the model here to avoid reloading for every request.
    embeddings = [
        Embedding(embedding=np.random.rand(512).tolist(), index=0, object="embedding")
    ]
    
    # Token usage calculation
    prompt_tokens = 20
    input_len = len(request["input"].split())
    total_tokens = input_len + prompt_tokens

    usage = Usage(prompt_tokens=prompt_tokens, total_tokens=total_tokens)
    
    # Return the response formatted as per OpenAI API structure
    return CreateEmbeddingResponse(
        data=embeddings,
        model=request["model"],
        object="list",
        usage=usage
    )


if __name__ == "__main__":
    # Initialize the API and server
    api = ChatAPI()
    server = ls.LitServer(api, spec=ls.OpenAISpec())

    # Add the embedding API route
    server.app.add_api_route(
        "/v1/embeddings",
        embedding_fn,
        methods=["POST"],
        tags=["embedding"],
        dependencies=[Depends(server.setup_auth())], 
    )

    # Run the server
    server.run(port=8000)

bhimrazy avatar Sep 09 '24 08:09 bhimrazy

Thanks for the great discussion.

Yes that's definitely something we want to enable. Spec is to make an API conform to a given API specification, I wouldn't abuse it.

What I would rather do is create something to launch a collection of LitServers in the same server.

Initially we thought it would be simpler to pass a list or dict of LitAPIs to LitServer, but then all the arguments to LitServer would have to be specified per-API and things would get very murky.

The simpler thing to do is to have a function or class that takes a collection of LitServers, which you then run.

Could be something like

embed_server = ls.LitServer(embed_api, ...)
llm_server = ls.LitServer(llm_api, ...)

run_servers(embed_server, llm_server)

# or

run_servers({"/embed-prefix": embed_server, "/predict-prefix": llm_server})

or we could introduce another server collection class, but the concept doesn't change.

This is good because it would give you the ability to specify worker settings, batching settings, etc per endpoint, which you absolutely need to do.

lantiga avatar Sep 09 '24 16:09 lantiga

Thanks so much, @lantiga, for the great idea! I’m excited about the direction and look forward to doing some research and making a contribution to it.

bhimrazy avatar Sep 09 '24 17:09 bhimrazy

Hi @bhimrazy! If you're interested in contributing to this issue, you can try the following:

  • We define a run_all function that accepts a list of LitServer objects.
  • run_all will create the socket, as shown here, and then perform the rest of the operations in a combined way for LitServe.run method.

Please let me know if you have any question.

aniketmaurya avatar Sep 10 '24 12:09 aniketmaurya

Sure, @aniketmaurya! I'll start drafting a PR.

Also, I do have a few confusions related to it, but I'll first review the details to gain a clearer understanding and then get back to you. Thank you 🙂

bhimrazy avatar Sep 10 '24 15:09 bhimrazy

hi @arkohut, would you be available to chat more about this issue? we are doing some research to enable this feature to the users in the best manner.

aniketmaurya avatar Sep 19 '24 11:09 aniketmaurya

hi @arkohut, would you be available to chat more about this issue? we are doing some research to enable this feature to the users in the best manner.

OK, I will tell more about my use case.

I am working on a project that requires multiple models to run on a local PC. The reason for this is to ensure personal privacy is not compromised.

Specifically, this project is much like a current project called Rewind. I need to extract text from screenshots, use a multimodal model to describe the screenshots, and ultimately use an embedding model to store the extracted data into a vector database. Then I can use text to search the indexed data.

In this process, multiple models are involved:

  1. OCR model
  2. VLM model
  3. Embedding model

I hope these models can be loaded on a local GPU, and preferably use a solution like litserve to ensure the operational efficiency of the models.

Currently, ollama seems to be a very good local model running solution, but it has the following issues:

  1. Ollama's support for newer models is often slower or even non-existent, far less flexible than litserve.
  2. Ollama itself tends to support the operation of LLMs, with relatively limited support for other models. Similarly, litserve is more flexible and can support a richer variety of models.

I would like to emphasize that the models have a significant impact on the effectiveness of the project, so I am very keen on running the best possible models locally, even with limited computational power. At present, it seems that many excellent VLM models are not supported by ollama by now, such as Qwen-VL, Florence 2, and InternVL2.

Even if, in the end, due to model performance limitations, running the models locally isn’t feasible, it is still crucial to have a solution that allows multiple models to run on a single GPU with a fast inference speed (rather than using multiple GPUs, even if it’s an A100 or H100 GPU).

arkohut avatar Sep 20 '24 01:09 arkohut

good

aceliuchanghong avatar Sep 25 '24 14:09 aceliuchanghong

hi, i want to understand how to manage GPU memory in case of multi-model serving ?

akansal1 avatar Sep 26 '24 08:09 akansal1

hi, i want to understand how to manage GPU memory in case of multi-model serving ?

hi @akansal1, if you have multiple model instances then each instance will take the GPU memory individually.

If you are using multiple workers, you can set set_per_process_memory_fraction to limit the cache allocator.

aniketmaurya avatar Sep 26 '24 11:09 aniketmaurya

For people looking to route multiple models in the same server, we have put together a new docs page here. Please let us know your feedback.

aniketmaurya avatar Oct 24 '24 21:10 aniketmaurya

@aniketmaurya am I correct in understanding this means there are no plans to support multiple routes?

The suggestion in the linked docs of psuedo-routing with a field in a JSON object feels pretty janky to me, and it seems like it ought to be core functionality for a server application to be able to serve more than one thing natively?

b-d-e avatar Nov 18 '24 20:11 b-d-e

Hi @b-d-e, the core focus here is serving a model at scale. We haven’t yet identified a truly convincing, production-based (non-workaround) scenario where this approach would be used. However, this issue will remain open for discussion and to explore any compelling real-world use cases.

aniketmaurya avatar Nov 18 '24 20:11 aniketmaurya

In my application scenario, in addition to the /predict endpoint, I need to implement some custom endpoints to meet the requirements of the upper-layer application's calls. This upper-layer application is mainly responsible for the large-scale deployment and management of services. I believe that supporting multiple endpoints is necessary. These endpoints do not necessarily have to be used for model inference; they can also be used for tasks such as querying model information (as mentioned in #366), invocation counts, etc. I personally look forward to this new feature.

gjgjh avatar Jan 02 '25 09:01 gjgjh

Hi @aniketmaurya , In the MultipleModelAPI , when using max_batch_size > 1, it seems like requests for different models will be batched together, is that correct?

Is there a way to ensure that requests are only batched together for the same model and not across different models (i.e. my bach should only contain samples for model1, not half model1 half model2)? Or do you plan on supporting this in the future?

pauldoucet avatar Apr 04 '25 10:04 pauldoucet

Hi! Our team is facing the same challenge. We need multiple endpoints to handle tasks like feedback collection, activating different components, managing various inputs, and serving different documentation to swagger. Additionally, we use middleware which we're not sure how to integrate into the LitServe structure.

Having to use a single endpoint or duplicate servers seems quite limiting. How should we implement a simple application like the one I've attached using LitServe? I'd prefer to avoid adding multiple if-statements within internal functions as it significantly reduces code readability.

Example

PROCESSING ENDPOINTS
@app.post("/api/documents", tags=["Document Processing"])
async def submit_document(document: DocumentRequest, background_tasks: BackgroundTasks):
    """Submit a document for classification and field extraction"""
     Implementation

FEEDBACK AND TRAINING ENDPOINTS
@app.post("/api/feedback", tags=["Feedback & Training"])
async def submit_feedback(feedback: FeedbackRequest):
    """Collect feedback to improve AI models"""
     Implementation

FRONTEND INTERACTION ENDPOINTS
@app.get("/api/documents/recent", tags=["Frontend"])
async def get_recent_documents(limit: int = 10, status: Optional[str] = None):
    """Provide data for frontend dashboard"""
     Implementation

 DEBUGGING AND MONITORING ENDPOINTS
@app.get("/api/system/metrics", tags=["Debug & Monitoring"])
async def get_system_metrics():
    """System metrics for monitoring"""
     Implementation



class RequestLoggingMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
         Generate transaction ID for tracking
        transaction_id = str(uuid4())
        request.state.transaction_id = transaction_id
        
        Request logging and timing
        start_time = time.time()
        try:
            response = await call_next(request)
            process_time = time.time() - start_time
            
             Add informational headers
            response.headers["X-Process-Time"] = str(process_time)
            response.headers["X-Transaction-ID"] = transaction_id
            
            return response
        except Exception as e:
             Centralized error handling
            return JSONResponse(
                status_code=500,
                content={"detail": "Internal server error", "transaction_id": transaction_id}
            )

 Simple middleware application to all endpoints
app.add_middleware(RequestLoggingMiddleware)

def custom_openapi():
    """Customize OpenAPI/Swagger documentation."""
    openapi_schema = get_openapi(
        title="AI Document Processor API",
        version="1.0.0",
        description="An API for document classification and field extraction using specialized AI models",
        routes=app.routes,
    )
    
     Organization by logical categories
    openapi_schema["tags"] = [
        {
            "name": "Document Processing",
            "description": "Operations for document processing",
        },
        {
            "name": "Feedback & Training",
            "description": "Feedback management and AI model training",
        },
         Other tags...
    ]
    
     Examples for specific endpoints
    document_upload_path = "/api/documents"
    if document_upload_path in openapi_schema["paths"]:
        if "post" in openapi_schema["paths"][document_upload_path]:
            openapi_schema["paths"][document_upload_path]["post"]["examples"] = {
                "Invoice Upload": {
                    "summary": "Upload a PDF ",
                    "value": {
                        "content_base64": "JVBERi0xLjMK...",
                        "filename": "doc.pdf",
                        "metadata": {"client_id": "ACME123"}
                    }
                }
            }
    
    return openapi_schema

app.openapi = custom_openapi `

P.S. We've also encountered some issues with span propagation and try-catch handling within LitServe. If you could update the documentation on how to manage logging and error handling, it would be extremely helpful.

giovanniMen avatar May 05 '25 15:05 giovanniMen

Additionally, I believe starting multiple litserve instances with different ports may introduce unnecessary complexity, compared to starting a single server with different endpoints.

If you do this with containers the port concern won't be there, you can have multiple containers that all listen on the same container port but with different friendly hostnames for each endpoint.

From there have one more container as a reverse proxy to route the endpoint URI you'd prefer to the appropriate litserve container via it's hostname (or with some reverse proxies you can annotate the container with a label that simplifies this).


Multiple endpoints like /embedding or /vlm/predict or /ocr/predict.

I haven't tried LitServe myself yet, but you can easily define subpaths of the URL to match that route to a different container and you can preserve what you want from that URI before forwarding the request to the container:

Docker compose.yaml examples

Save either of the YAML snippets as compose.yaml

With a reverse proxy

Here's another example with just Caddy with a config file instead of container labels as config:

services:
  reverse-proxy:
    image: caddy:2.9
    configs:
      - source: caddyfile
        target: /etc/caddy/Caddyfile
    # This is only for running locally to test the example
    # acts as DNS requests for example.test are routed to this Caddy container
    networks:
      default:
        aliases:
          - example.com

  # Your LitServe containers, the service name will have a routable DNS name managed by Docker internally:
  embedding:
    image: traefik/whoami

  ocr:
    image: traefik/whoami

  # This container is using Caddy to be a little bit more useful for demo purposes:
  vlm:
    image: caddy:2.9
    configs:
      - source: example
        target: /etc/caddy/Caddyfile

configs:
  caddyfile:
    content: |
      example.com {
        # This is just for local testing, remove this `tls internal` line
        # in production to provision certs for `example.com` via LetsEncrypt:
        tls internal

        # This takes the first path segment of the URL request
        # and then forwards the request to a container with that same internal network name:
        # eg: https://example.com/vlm/predict would route to http://vlm/predict
        vars service-name {path.0}
        handle_path /{vars.service-name}* {
          reverse_proxy {vars.service-name}{uri}
        }
      }

  # Using Caddy again to act as a service with some different routes:
  example:
    content: |
      :80 {
        handle_path /predict/* {
          respond "predicting..."
        }

        handle_path /whatever/* {
          respond "whatever..."
        }

        # Fallback:
        handle {
          respond "Hello from /"
        }
      }

networks:
  default:
    name: example-net
# Start the containers and run a new alpine container connected to that network:
$ docker compose up -d --force-recreate
$ docker run --rm -it --network example-net alpine ash

# Add curl and make a request:
$ apk add curl
$ curl -kL http://example.com/vlm/predict/
predicting...

I've skimped on this a bit assuming general familiarity with containers and reverse proxies, just to focus on the routing aspect as a solution which is effectively these few lines in Caddy:

vars service-name {path.0}
handle_path /{vars.service-name}* {
  reverse_proxy {vars.service-name}{uri}
}

With container labels

With Caddy Docker Proxy (CDP), similar to Traefik you can route by labels as automated config for Caddy instead:

services:
  reverse-proxy:
    image: lucaslorentz/caddy-docker-proxy:2.9
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    ports:
      - "80:80"
      - "443:443"

  # Simple with subdomains instead:
  # https://vlm-predict.example.com
  example:
    image: traefik/whoami
    labels:
      caddy: vlm-predict.example.com
      caddy.reverse_proxy: "{{upstreams 80}}"

  # This strips off the path prefix such that:
  # https://example.com/vlm/predict/whatever internally routes the request to
  # http://<ip-of-container>:80/whatever
  example-b:
    image: traefik/whoami
    labels:
      caddy: example.com
      caddy.handle_path: /vlm/predict/*
      caddy.handle_path.0_reverse_proxy: "{{upstreams 80}}"

  # This strips off the path prefix and then rewrites the URL such that:
  # https://example.com/ocr/predict/something internally routes the request to
  # http://<ip-of-container>:80/example/whatever
  example-c:
    image: traefik/whoami
    labels:
      caddy: example.com
      caddy.handle_path: /ocr/*
      caddy.handle_path.0_rewrite: * /example/whatever{uri}
      caddy.handle_path.1_reverse_proxy: {{upstreams}}

Assuming that /vlm should represent the LitServe container that'd normally be called with /predict, which is what I demonstrated in the first example with a Caddyfile config, that'd look like this:

services:
  example:
    image: traefik/whoami
    labels:
      caddy: example.com
      caddy.handle_path: /vlm/*
      caddy.handle_path.0_reverse_proxy: "{{upstreams 80}}"

No need to manage a separate Caddyfile config, so long as you are comfortable using Docker Compose with the YAML snippet it is pretty simple :)

polarathene avatar May 17 '25 22:05 polarathene

thank you everyone for the patience over this. We just added multiple endpoint support in this PR. I am closing this issue now, if you face any issue while using multiple endpoints then please feel free to create a new issue.

PS: It's available in the main branch and version litserve==0.2.11a2.

Docs are available here.

aniketmaurya avatar May 27 '25 19:05 aniketmaurya