[REQUEST]Does deepspeed support dynamic batch during inference?
model size: 6B gpu mem using: 24g gpu type: A100 40G latency one request: 3s
When I use deepspeed for single-card inference, the qps does not exceed 2, and the utilization rate of gpu is about 52%. When will deepspeed support dynamic batch size and improve the utilization rate of gpu?
You can check NVIDIA Triton which supports dynamic batching and other stuff to increase your GPU utilisation.
Not sure where/how DeepSpeed could support such stuff, without extending its scope considerably.
You can look at AWS's DeepJavaLibrary Serving - https://github.com/deepjavalibrary/djl-serving
It uses netty/java to dispatch the requests to inference, and can be configured to batch the requests dynamically based on a time window. DJL Serving is battle tested with DeepSpeed / PyTorch and high load inference environments.
https://djl.ai/ - umbrella project
These are some great tutorials on how to use djl serving / deepspeed for inference - https://github.com/aws/amazon-sagemaker-examples/tree/main/inference/generativeai/llm-workshop
thx.
@colynhn @trianxy @stan-kirdey if not too late: I am building dynamic batch sizes (and corresponding LR scaling) on deepspeed in PR 5237, as part of the data analysis module. Stay tuned.