vllm Doc]: Latency vs Throughput Configurations

Doc]: Latency vs Throughput Configurations

Open antferdom opened this issue 1 year ago • 0 comments

📚 The doc issue

Context: During July 9, 2024, vLLM open office hours (FP8), there were several questions regarding how to optimize model deployment inference configurations targeting the two major regimes: latency and throughput (batch processing). Relevant articles around the same discussion, Efficiently Scaling Transformer Inference. Whereas there is an exploration of batch size, chip count and context length. Additionally we should explore the whole set of features (e.g optimized kernels, quantization strategies, pipeline/tensor/sequence parallelism)

Suggest a potential alternative/fix

Targets: Create documentation making explicit what configurations are suitable for each regime, and listing some of its constraints and tradeoffs. The creation of this documentation should add new benchmarking and experimental scripts for reproducing such results. Simultaneously this issue will list the set of compatible flags, thus helping understanding invalid deployment configurations.

Jul 09 '24 21:07 antferdom

vllm vllm copied to clipboard

Doc]: Latency vs Throughput Configurations

📚 The doc issue

Suggest a potential alternative/fix

vllm
vllm copied to clipboard