server feat: OpenAI Compatible Frontend

Description

Adds an OpenAI Compatible Frontend for Triton Inference Server as a FastAPI application using the tritonserver in-process python bindings for the following endpoints:

/v1/models
/v1/completions
/v1/chat/completions

Additionally there are some other observability routes exposed for convenience:

/metrics - Prometheus-compatible metrics from Triton Core
/health/ready - General health check for inference readiness, similar to NIM schema.

This is a refactor and extension of the original example by @nnshah1 here: https://github.com/triton-inference-server/tutorials/pull/104 to include more thorough testing.

Refactors/Changes

The file structure was refactored to make each logical component smaller, more discoverable, and more digestible. It was loosely based on the following resources:
- https://fastapi.tiangolo.com/tutorial/bigger-applications/#an-example-file-structure
- https://github.com/zhanymkanov/fastapi-best-practices?tab=readme-ov-file#project-structure
The global variables were converted to use the FastAPI app object for storing state instead, such as app.server to access the in-process tritonserver across different routes handling requests.
Encapsulated info related to model objects in a ModelMetadata dataclass to keep things closer together.
Made the TOKENIZER an explicit setting provided by user at startup, rather than something we try to infer based on the Triton model name used in the model repository. This is more resilient to BYO/custom models, and models not produced by the Triton CLI. The BACKEND or request conversion format should probably be made an explicit setting to the user as well to better deal with edge cases or LLMs defined in other backends like python or onnxruntime.
Updated various parts of the generated schemas/openai.py that seemed to be generated for pydantic v1 (currently using v2) to account for all of the deprecation warnings that are currently deprecated but will be fully removed in coming versions of pydantic: https://docs.pydantic.dev/latest/migration/

Testing Methodology

Original set of tests were added using FastAPI's TestClient for simplicity in getting started. I later found out that this TestClient object, despite interacting with the application and logic as normal, does not actually expose the network host/ports for interaction from other clients. It would probably be good to move away from TestClient and use the OpenAIServer utility, described below, for all testing later on.
OpenAI client tests were then added to show how this implementation can act as a drop-in replacement for current users of the openai python client library, which required a OpenAIServer utility class to run the application differently then the TestClient flow expects, similar to vLLM's utility here.
Streaming tests were done through the OpenAI client, as the documentation around streaming with TestClient were sparse.
Overall the tests were meant to add broad support for various types of clients (raw http requests, openai client, genai-perf) and broad testing for the supported parameters (test that changes in the openai-facing values produce different behavior in the backends). Deep dives into each and every parameter and exactly how they would behave for each value (ex: top_p, top_k) was not done, and I would expect more to be tested by each backend implementation.

Open Questions

[ ] Should we expose --backend/BACKEND or some signal to explicitly choose a "triton request format" to handle cases where it may be ambiguous or unclear? Ex: TRT-LLM backend's python implementation is just backend: "python" and has no backend: "tensorrtllm" anywhere when used. I lean towards exposing it to allow for explicitly defined behavior, but just haven't yet.
[x] Should /v1/models list all models in the model repository, even if only some of those models are actually compatible with the completions / chat endpoints? It may be difficult to automatically infer which models are "compatible" or not and account for all scenarios, so it may nee to be explicit by the user if we want to limit which models are returned.
- Let's move towards only returning the front-facing models that should be interacted with for now, this will be done in a separate PR. We can also expose a SERVED_MODEL_NAME for a cosmetic mapping from something like tensorrt_llm_bls -> meta-llama/Meta-Llama-3.1-8b-Instruct.
[x] What to do with the docker/Dockerfile* and general expected user-facing workflow. I added these as examples for myself to use during testing/development and included how they'd be used in the README, but ultimately we will probably be publishing this code within the respective containers and these DIY containers are likely unnecessary.\
- These will be removed as we get closer to have something published in a container.

Open Items for follow-up PRs

[ ] Benchmarking, especially under high concurrency and load
[ ] Add some tests that use genai-perf to maintain compatibility and catch regressions
[ ] README and Documentation improvements
[ ] KServe frontend bindings integration
[ ] Unsupported OpenAI schema parameters like logit_bias, logprobs, n > 1. These are mostly just unexplored, and haven't yet been scoped out to see the effort involved in supporting them.
[ ] Unsupported OpenAI schema response fields like usage
[ ] Migrating TestClient tests to use the OpenAIServer utility for all testing instead

Notes

I left in all the commits over time if you want to watch how the code evolved from the original, and see some of the edge cases that were caught by testing such as text/event-stream headers missing for streaming, temperature being ignored by TRT-LLM BLS, content=None for streaming messages causing genai-perf errors compared to content="", certain sampling parameters being converted to incorrect types internally, etc.

Aug 22 '24 01:08 rmccorm4

One general comment on the /v1/models open - I think we should limit to the models that support the openai endpoints instead of listing all models. We can make it explicit via command line option like backend and tokenizer.

Also agree for first pass lets be as explicit as possible - we can simplify after if needed / there is an opportunity.

Aug 22 '24 06:08 nnshah1

@KrishnanPrash FYI, use this python/... structure as the reference for KServe. This is what I meant when I commented on the file structure.

Aug 23 '24 22:08 GuanLuo

hello i have question about triton server running using python code import tritonserver

i have 4 gpus for serve, so i have to set world size 4 but i can not found any option or tutorials, if you can help I'd be so grateful.

Sep 05 '24 15:09 dongs0104

Would it be possible to include /embeddings endpoint too? It would be the last missing piece for our company to try out Triton for our on-prem RAG/Agentic solutions.

Sep 19 '24 15:09 faileon

This is great!

Sep 20 '24 00:09 robertgshaw2-redhat

Would it be possible to include /embeddings endpoint too? It would be the last missing piece for our company to try out Triton for our on-prem RAG/Agentic solutions.

strong upvote for this feature to be enabled, would be great and, indeed - complete the whole functionality

Oct 10 '24 07:10 chorus-over-flanger

Hi @chorus-over-flanger @faileon, thanks for expressing interest in the /embeddings route! It's on our radar as another feature to add to the OpenAI frontend support (chat, completions, and models) added in this PR.

Since the route itself doesn't have too many parameters defined in the spec, it may be relatively straightforward to add.

If you have any particular models you'd like to see working as an example for /embeddings, do let us know.

If you have any feedback or interest in contributing to the project, let us know as well.

Thanks!

Oct 12 '24 00:10 rmccorm4

Hi @rmccorm4 ,

thanks a lot on this PR. Huge work very appreciated here.

Do you know if you're planning to add a support for Function Calling / Tools support (DLIS-7168) soon ?

Nov 28 '24 21:11 copasseron

Hello,

is it already clear when the beta phase will be over?

Thanks for this very important PR!

Nov 28 '24 22:11 DimadonDL

Hello @rmccorm4

I've been testing the OpenAI frontend of TIS with vLLM, and I'm impressed with the overall results - excellent work! I'd like to share some findings and observations,

#7845;
Performance: When load-testing, I noticed the OpenAI frontend (TIS/vLLM) shows lower TPS compared to the Generate extension of TIS with vLLM backend. This testing was primarily conducted with quantized 7/8B models. Can provide more detailed test results if needed;
Regarding the embeddings endpoint: In my specific use case found that large model embeddings don't perform as well as current SOTA embedding models with million-scale parameters, which led to switch to a different backend. Surprisingly I noticed that vLLM recently added support for various embedding models (both billion and million parameter sizes), whereas previously it only supported larger models;

Would also request a function calling/tools support in OpenAI frontend, as well as structured_output,

huge huge work!

Dec 02 '24 07:12 chorus-over-flanger

You cooked on this one @rmccorm4

Jan 11 '25 00:01 ishandhanani

Hi, is there a timeline for this PR? This would be such a great feature 🤩

Feb 28 '25 22:02 DimadonDL

hi @DimadonDL, feel free to try this functionality in our latest 25.02 vllm or trtllm containers. How-tos can be found here: https://github.com/triton-inference-server/server/tree/main/python/openai

Mar 03 '25 20:03 oandreeva-nv

Hi @oandreeva-nv,

Thanks. I have tested this feature. Did you know when its leaves the beta status?

Mar 03 '25 20:03 DimadonDL

server server copied to clipboard

feat: OpenAI Compatible Frontend

Description

Refactors/Changes

Testing Methodology

Open Questions

Open Items for follow-up PRs

Notes

server
server copied to clipboard