DeepSpeed-MII
DeepSpeed-MII copied to clipboard
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Looking at the engine and I would like to run inference with the **Mixtral model** while doing _expert parallelism_. I see that **DeepSpeed** itself seems to have some support but...
I am running DeepSpeed-MII on a system with two NVIDIA A100X GPUs. I am running the following simple latency benchmark code for inference: ```python import math import time import mii...
Is there any way to retrieve logprob in DeepSpeed-mii, just as we retrieve logprob by specifying options in vLLM? My intention is to find out how the output from deepspeed-mii,...
Hi DeepSpeed teams, Thank you for your great work! As the title suggests, the "01-ai/Yi-34B-Chat" model cannot run properly with DeepSpeed-MII version 0.2.3. The encountered error message is as follows:...
**Streaming:** Is there a way to apply streaming? I want to send a query to the server using curl and receive the results token by token. However, I couldn't find...
MII invokes [`MIIAsyncPipeline`](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/batching/ragged_batching.py#L635) for persistent deployments. During runtime, requests are passed from `GeneratorReply` to the backend model [through](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/grpc_related/modelresponse_server.py#L73) the `MIIAsyncPipeline.put_request` function. In this function, it [requests](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/batching/ragged_batching.py#L675) a `uid` for each...
support stream
support Qwen1.5
This PR implements reusing of KV cache across multiple requests. You can set `enable_prefix_cache` to `True` in `RaggedInferenceEngineConfig` to enable this feature. ```python config = RaggedInferenceEngineConfig(enable_prefix_cache=True) ``` This feature keeps...
coming soon...