DeepSpeed-MII icon indicating copy to clipboard operation
DeepSpeed-MII copied to clipboard

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Results 149 DeepSpeed-MII issues
Sort by recently updated
recently updated
newest added

Hi, I am serving using MII for model llama-2-7b-hf with tensor-parallel parameter 1. When the input is not very long, the output can be generated properly. However, when the length...

Hi,I served this model from huggingface: 01-ai/Yi-6B-200K. When requesting for input of length 100K,this error occurs:

When I run examples in [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/blob/master/benchmarks/inference/mii/server.py) to start a server, it occupy all the GPU memory at the beginning. Is it possible to config the max gpu memory that it...

My understanding is that we have to build a fastAPI wrapper, and during intialized phase we call `client = mii.client("mistralai/Mistral-7B-v0.1")` and we implement a handler to call `client.generate`.

Hi everyone, I am new to DeepSpeed MII, and I have just made several attempts according to `pipeline.py` in the provided examples. Everything works fine initially with small models, such...

Thank you for your hard work. I am really excited about MII performance. I have some questions Does token streaming function supported now? If token streaming is supported, I would...

@mrwyattii Use latest main branch and test model is llamav2-7b. When I use tp=4 to test a single sentence inference, it costs 267.98s, but when tp=1, it costs 7s to...

Please add support for **Mosaic MPT** models and **some other architectures with less than 1b parameters.** Also, it would be great if there can be some instructions how someone can...