DeepSpeed [REQUEST]Does deepspeed support dynamic batch during inference?

May 05 '23 10:05 colynhn

model size: 6B gpu mem using: 24g gpu type: A100 40G latency one request: 3s

When I use deepspeed for single-card inference, the qps does not exceed 2, and the utilization rate of gpu is about 52%. When will deepspeed support dynamic batch size and improve the utilization rate of gpu?

May 05 '23 10:05 colynhn

You can check NVIDIA Triton which supports dynamic batching and other stuff to increase your GPU utilisation.

Not sure where/how DeepSpeed could support such stuff, without extending its scope considerably.

May 05 '23 10:05 trianxy

You can look at AWS's DeepJavaLibrary Serving - https://github.com/deepjavalibrary/djl-serving

It uses netty/java to dispatch the requests to inference, and can be configured to batch the requests dynamically based on a time window. DJL Serving is battle tested with DeepSpeed / PyTorch and high load inference environments.

https://djl.ai/ - umbrella project

These are some great tutorials on how to use djl serving / deepspeed for inference - https://github.com/aws/amazon-sagemaker-examples/tree/main/inference/generativeai/llm-workshop

May 08 '23 20:05 stan-kirdey

thx.

May 24 '23 09:05 colynhn

@colynhn @trianxy @stan-kirdey if not too late: I am building dynamic batch sizes (and corresponding LR scaling) on deepspeed in PR 5237, as part of the data analysis module. Stay tuned.

Mar 11 '24 12:03 bm-synth