DeepSpeed-MII issues

Results 149 DeepSpeed-MII issues

Sort by recently updated

[QUERY] Expert Parallelism Supported?

Looking at the engine and I would like to run inference with the **Mixtral model** while doing _expert parallelism_. I see that **DeepSpeed** itself seems to have some support but...

Shamauk

Attempting to flush sequence N which does not exist

I am running DeepSpeed-MII on a system with two NVIDIA A100X GPUs. I am running the following simple latency benchmark code for inference: ```python import math import time import mii...

aagontuk

Compute perplexity

Is there any way to retrieve logprob in DeepSpeed-mii, just as we retrieve logprob by specifying options in vLLM? My intention is to find out how the output from deepspeed-mii,...

Sh1gechan

Cannot run Yi-34B-Chat => ValueError: Unsupported q_ratio: 7

Hi DeepSpeed teams, Thank you for your great work! As the title suggests, the "01-ai/Yi-34B-Chat" model cannot run properly with DeepSpeed-MII version 0.2.3. The encountered error message is as follows:...

joeking11829

few questions regarding the implementation of streaming and batching

**Streaming:** Is there a way to apply streaming? I want to send a query to the server using curl and receive the results token by token. However, I couldn't find...

KimMinSang96

[BUG] MII Backend Hangs After 9999 Exceptions in `MIIAsyncPipeline.put_request`

MII invokes [`MIIAsyncPipeline`](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/batching/ragged_batching.py#L635) for persistent deployments. During runtime, requests are passed from `GeneratorReply` to the backend model [through](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/grpc_related/modelresponse_server.py#L73) the `MIIAsyncPipeline.put_request` function. In this function, it [requests](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/batching/ragged_batching.py#L675) a `uid` for each...

weiqisun

support stream

ZZhangxian

support Qwen1.5

ZZhangxian

Reuse KV cache of prefixes

This PR implements reusing of KV cache across multiple requests. You can set `enable_prefix_cache` to `True` in `RaggedInferenceEngineConfig` to enable this feature. ```python config = RaggedInferenceEngineConfig(enable_prefix_cache=True) ``` This feature keeps...

tohtana

Some fixes to make openai entrypoint work out of the box

coming soon...

svaruag

DeepSpeed-MII
DeepSpeed-MII copied to clipboard

Metadata

[QUERY] Expert Parallelism Supported?

Attempting to flush sequence N which does not exist

Compute perplexity

Cannot run Yi-34B-Chat => ValueError: Unsupported q_ratio: 7

few questions regarding the implementation of streaming and batching

[BUG] MII Backend Hangs After 9999 Exceptions in `MIIAsyncPipeline.put_request`

support stream

support Qwen1.5

Reuse KV cache of prefixes

Some fixes to make openai entrypoint work out of the box

← Metadata

Owner

Metadata

DeepSpeed-MII DeepSpeed-MII copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeed-MII
DeepSpeed-MII copied to clipboard