LLMPilot: Generate the best deployment configuration for model + GPU combination
🚀 Feature Description and Motivation
For the 33b model deployment, we have a few options, A10, V100-32GiB, L20, L40. Technically, we can launch the instance using M * N GPU Type. However, we need to evaluate the most optimal plan for given latency/throughput/cost/ goals.
Selecting the most optimal GPU deployment for the model is a complex task that requires careful evaluation of those key metrics. By conducting benchmarks, analyzing costs, and considering community input, we can make an informed decision that meets our project goals. This RFC serves as a starting point for the discussion and invites contributions from all stakeholders.
Use Case
As a user, I want to know the best gpu types to run a specific model.
Proposed Solution
- benchmark tools + benchmark datasets (pluggable)
- experiment plans
- generate results
VKE team already have some tools, we should review and evaluate that work.
the model parameter like context length and parallelism could be different which brings additional challenges to get apple-to-apple comparison result
We should leverage the deepseek-33b case to perfect the solution here. @kr11 Let's have a short discussion tomorrow on the next steps. VKE will publish their tools and we probably can leverage the parameter tuning in long run
In v0.1.0, we should focus on using, polishing, improving existing tools build by VKE.
parameter tuning or profiling would be advance features, we plan to work on in v0.2.0.
This is the auto-tuning or profiling related stories. We also come up ideas like LLMPilot, v0.3.0 is too tight for this story and it can be postponed to v0.4.0