DeepSpeed-MII icon indicating copy to clipboard operation
DeepSpeed-MII copied to clipboard

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Results 149 DeepSpeed-MII issues
Sort by recently updated
recently updated
newest added

Looking at the engine and I would like to run inference with the **Mixtral model** while doing _expert parallelism_. I see that **DeepSpeed** itself seems to have some support but...

I am running DeepSpeed-MII on a system with two NVIDIA A100X GPUs. I am running the following simple latency benchmark code for inference: ```python import math import time import mii...

Is there any way to retrieve logprob in DeepSpeed-mii, just as we retrieve logprob by specifying options in vLLM? My intention is to find out how the output from deepspeed-mii,...

Hi DeepSpeed teams, Thank you for your great work! As the title suggests, the "01-ai/Yi-34B-Chat" model cannot run properly with DeepSpeed-MII version 0.2.3. The encountered error message is as follows:...

**Streaming:** Is there a way to apply streaming? I want to send a query to the server using curl and receive the results token by token. However, I couldn't find...

MII invokes [`MIIAsyncPipeline`](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/batching/ragged_batching.py#L635) for persistent deployments. During runtime, requests are passed from `GeneratorReply` to the backend model [through](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/grpc_related/modelresponse_server.py#L73) the `MIIAsyncPipeline.put_request` function. In this function, it [requests](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/batching/ragged_batching.py#L675) a `uid` for each...

This PR implements reusing of KV cache across multiple requests. You can set `enable_prefix_cache` to `True` in `RaggedInferenceEngineConfig` to enable this feature. ```python config = RaggedInferenceEngineConfig(enable_prefix_cache=True) ``` This feature keeps...