aibrix LLMPilot: Generate the best deployment configuration for model + GPU combination

🚀 Feature Description and Motivation

For the 33b model deployment, we have a few options, A10, V100-32GiB, L20, L40. Technically, we can launch the instance using M * N GPU Type. However, we need to evaluate the most optimal plan for given latency/throughput/cost/ goals.

Selecting the most optimal GPU deployment for the model is a complex task that requires careful evaluation of those key metrics. By conducting benchmarks, analyzing costs, and considering community input, we can make an informed decision that meets our project goals. This RFC serves as a starting point for the discussion and invites contributions from all stakeholders.

Use Case

As a user, I want to know the best gpu types to run a specific model.

Proposed Solution

benchmark tools + benchmark datasets (pluggable)
experiment plans
generate results

Aug 22 '24 07:08 Jeffwan

VKE team already have some tools, we should review and evaluate that work.

Aug 22 '24 07:08 Jeffwan

the model parameter like context length and parallelism could be different which brings additional challenges to get apple-to-apple comparison result

Aug 22 '24 07:08 Jeffwan

We should leverage the deepseek-33b case to perfect the solution here. @kr11 Let's have a short discussion tomorrow on the next steps. VKE will publish their tools and we probably can leverage the parameter tuning in long run

Aug 29 '24 13:08 Jeffwan

In v0.1.0, we should focus on using, polishing, improving existing tools build by VKE.

Sep 11 '24 00:09 Jeffwan

parameter tuning or profiling would be advance features, we plan to work on in v0.2.0.

Oct 17 '24 00:10 Jeffwan

This is the auto-tuning or profiling related stories. We also come up ideas like LLMPilot, v0.3.0 is too tight for this story and it can be postponed to v0.4.0

Apr 28 '25 22:04 Jeffwan