vllm
vllm copied to clipboard
Doc]: Latency vs Throughput Configurations
📚 The doc issue
Context: During July 9, 2024, vLLM open office hours (FP8), there were several questions regarding how to optimize model deployment inference configurations targeting the two major regimes: latency and throughput (batch processing). Relevant articles around the same discussion, Efficiently Scaling Transformer Inference. Whereas there is an exploration of batch size, chip count and context length. Additionally we should explore the whole set of features (e.g optimized kernels, quantization strategies, pipeline/tensor/sequence parallelism)
Suggest a potential alternative/fix
Targets: Create documentation making explicit what configurations are suitable for each regime, and listing some of its constraints and tradeoffs. The creation of this documentation should add new benchmarking and experimental scripts for reproducing such results. Simultaneously this issue will list the set of compatible flags, thus helping understanding invalid deployment configurations.