MoE-Infinity [Feature Request]How to measure the generation throughput(token/s)?

Prerequisites

[x] I have searched existing issues and reviewed documentation.

Problem Description

I want to measure the DeepSeek-v2-Lite-Chat throughput of MoE-infinity using RTX 4080 Super(16GB).The code I used is as follows:. But the average throughput is about 2.935 token/s, which is slower than llama.cpp(in my test the decode throughput is 3.99 tokens per second).Is it something wrong in my test? I have read your paper, the MoE-infinity is much faster, but I got a slower result? Note that The device_memory_ratio is 0.7 because I will encounter CUDA error if I use a number greater than 70%.

Proposed Solution

Here is my inference code of MoE-infinity:

import torch
import time
import os
from transformers import AutoTokenizer, TextStreamer
from moe_infinity import MoE
os.environ['CUDA_VISIBLE_DEVICES'] = '2'  
user_home = os.path.expanduser('~')
checkpoint = "/share-data/wzk-1/model/deepseek-v2-lite"
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2-Lite-Chat")
tokenizer.pad_token = tokenizer.eos_token
config = {
    "offload_path": os.path.join(user_home, "moe-infinity"),
    "device_memory_ratio": 0.7,  
}
model = MoE(checkpoint, config)
streamer = TextStreamer(tokenizer)


input_texts = [
    "Tell me a story begin with: Once upon a time",
    "Give me an introduction of Bitcon",
    "Translate 'I love you' into at least 10 languages",
    "write a C++ program of QuickSort"
]

inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True)
input_ids = inputs["input_ids"].to("cuda:0")
attention_mask = inputs["attention_mask"].to("cuda:0")

total_time = 0
total_tokens = 0
for i in range(len(input_texts)):
    start_time = time.time()
    output_ids = model.generate(
        input_ids=input_ids[i].unsqueeze(0),
        attention_mask=attention_mask[i].unsqueeze(0),
        streamer=streamer,
        max_new_tokens=256
    )
    end_time = time.time()
    elapsed_time = end_time - start_time
    total_time += elapsed_time
    generated_tokens = len(output_ids[0]) - len(input_ids[I])
    total_tokens += generated_tokens
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    decode_throughput = generated_tokens / elapsed_time
    # print(f"Output {i+1}: {output_text}")
    print(f"generated {generated_tokens} using {elapsed_time:.3f} seconds, decode throughput is {decode_throughput:.3f} token/s")
    print("-" * 60)
throughput = total_tokens / total_time
print(f"Total time: {total_time:.3f} seconds")
print(f"Total tokens generated: {total_tokens}")
print(f"Throughput: {throughput:.3f} tokens/second")

Alternatives Considered

No response

Additional Context

Here is one of my outputs using MoE-infinity. I have try many inputs but the decode throughput is about 2.746 token/s.

Translate 'I love you' into at least 10 languages
1. Spanish: Te amo
2. French: Je t'aime
3. German: Ich liebe dich
4. Italian: Ti amo
5. Portuguese: Eu te amo
6. Russian: Я тебя люблю (Ya tebya lyublyu)
7. Chinese (Simplified): 我爱你 (Wǒ ài nǐ)
8. Japanese: 愛してる (Aishiteru)
9. Hindi: मैं तुमसे प्यार करता हूँ (Main tumse pyar karta hoon)
10. Arabic: أحبك (Uhibbuka)<｜end▁of▁sentence｜>
generated 163 using 55.541 seconds, decode throughput is 2.935 token/s

Here is the outputs of llama.cpp, which uses '.gguf' files converted from original model files.

llama_perf_sampler_print:    sampling time =      26.48 ms /   349 runs   (    0.08 ms per token, 13182.25 tokens per second)
llama_perf_context_print:        load time =    5984.14 ms
llama_perf_context_print: prompt eval time =    2251.78 ms /    39 tokens (   57.74 ms per token,    17.32 tokens per second)
llama_perf_context_print:        eval time =  122985.89 ms /   491 runs   (  250.48 ms per token,     3.99 tokens per second)
llama_perf_context_print:       total time =  144492.51 ms /   530 tokens

Importance

Nice to have

Usage Statistics (Optional)

No response

Mar 25 '25 08:03 wuooo339

You have include the prefill time as in decoding throughput which is not correct, TTFT needs to be excluded. See StopWatch for example

Mar 25 '25 10:03 drunkcoding

@drunkcoding I am sorry to disturb you again. But in my case, llama.cpp is about 70% faster. The result below is just one case of the "tasksource/bigbench" and other outputs are similar.

Here is the result of llama.cpp:

In what follows, we provide short narratives, each of which illustrates a common proverb. Narrative: Vincent was a leather jacket wearing greasy haired tough guy. Everyone at school was scared of Vincent. One day Samantha was stranded when her car broke down. Vincent rode by on his motorcycle and offered her a ride home. The next day at school Samantha told all her friends that despite how tough Vincent made himself out to be, he was actually a very nice guy.This narrative is a good illustration of the following proverb:
Ah, the proverb that comes to mind after reading this narrative is: "Never judge a book by its cover." Vincent's tough exterior and fearsome reputation are in stark contrast to his act of kindness towards Samantha. By offering her a ride home, he defies the stereotype others might have of him, revealing a softer, more compassionate side. This narrative serves as a powerful illustration of the proverb's message, suggesting that appearances can be deceiving and that it's important to look beyond first impressions to truly understand a person's character.


llama_perf_sampler_print:    sampling time =       8.41 ms /   223 runs   (    0.04 ms per token, 26519.21 tokens per second)
llama_perf_context_print:        load time =   61746.90 ms
llama_perf_context_print: prompt eval time =   17044.86 ms /   887 tokens (   19.22 ms per token,    52.04 tokens per second)
llama_perf_context_print:        eval time =  189485.49 ms /  1134 runs   (  167.09 ms per token,     5.98 tokens per second)
llama_perf_context_print:       total time =  420482.06 ms /  2021 tokens

Here is the result of MoE-infinity, let's count the decode_throughput = 1/0.283521 =3.527 tokens per second

"Appearances are often deceiving."

In this narrative, Vincent appears to be a tough guy, but he actually shows kindness by offering Samantha a ride home. This demonstrates that appearances can be misleading, as Vincent's tough exterior does not reflect his true nature. The proverb "Appearances are often deceiving" highlights the idea that one should not judge a book by its cover or a person by their outward appearance, as it may not accurately represent their character 
Prefilling time: 1.997255563735962 seconds
Decoding time: None seconds
Decoding iterations: 100
Decoding time per iteration: 0.2835213017463684 seconds

Mar 27 '25 07:03 wuooo339

on both systems the results seems to be counter-intuitive to me. For MoE-Infinity, which commit you are building on? since there are some updates recently. Would you help me to reproduce llama.cpp by providing the convert and test script?

Mar 27 '25 16:03 drunkcoding

@drunkcoding I used the latest commit to build MoE-infinity in RTX 4080 Super(16GB). And the script I used to test MoE-infinity is examples/interface_example.py.To begin the test, I downloaded DeepSeek-V2-Lite-Chat from modelscope first.

modelscope download --model deepseek-ai/DeepSeek-V2-Lite-Chat README.md --local_dir $HOME/model/deepseek-v2-lite

I didn't use the llama.cpp test script because it automatically provides throughput data after execution.Here are my test steps for evaluating llama.cpp(https://github.com/ggml-org/llama.cpp.git). At first, cd llama.cpp and build

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Then, I converted Deepseek-V2-Lite to '.gguf' format which is required by llama.cpp.

python convert_hf_to_gguf.py $HOME/model/deepseek-v2-lite --outfile $HOME/model/deepseek-v2-lite/deepseek-v2-lite.gguf

I just run llama.cpp using the following command. -ngl 13 means to load 13 layers of deepseek-v2-lite.gguf to GPU and offload other layers to CPU. The rationale behind using -ngl 13 was to fully utilize the RTX 4080 Super's 16GB VRAM by loading exactly 13 layers of the model onto the GPU.

./build/bin/llama-cli -m $HOME/model/deepseek-v2-lite/deepseek-v2-lite.gguf -ngl 13

The model will run in interactive chat mode.After enter all the input and press Ctrl+C you can get the total throughput and terminate the conversation session.

llama_perf_sampler_print:    sampling time =       8.41 ms /   223 runs   (    0.04 ms per token, 26519.21 tokens per second)
llama_perf_context_print:        load time =   61746.90 ms
llama_perf_context_print: prompt eval time =   17044.86 ms /   887 tokens (   19.22 ms per token,    52.04 tokens per second)
llama_perf_context_print:        eval time =  189485.49 ms /  1134 runs   (  167.09 ms per token,     5.98 tokens per second)
llama_perf_context_print:       total time =  420482.06 ms /  2021 tokens

From a practical usage perspective, I clearly feel that llama.cpp loads faster and generates responses more quickly.

Mar 28 '25 02:03 wuooo339

@drunkcoding It seem that ExpertPredictor and any other functions of expert cache haven't been called yet, because I change the code by adding print("find_most_similar", time.time() - start_time) here when build in your project pip install -e . but no effect reflected in my output. At the same time, I haven't found anywhere to call ExpertPredictor at all in the source code?

class ExpertPredictor:
    def __init__(self, config: PretrainedConfig) -> None:
        self.num_layers, self.num_experts, self.num_encoder_layers = (parse_moe_param(config))
        self.layer_decay_func = lambda x, l, L: -1 / (L + 1) * (x - l) + 1
    def add_tracer(self, tracer: ExpertTracer):
        self.tracer = tracer
    def predict(self, seq_id, expert_list, layer_idx):
        print("predict function:")
        self.tracer.update_entry(seq_id, expert_list, layer_idx)
        current_entry = self.tracer.get_entry(seq_id)
        start_time = time.time()
        expert_matrix = self.tracer.find_most_similar(current_entry.matrix, layer_idx)
        print("find_most_similar", time.time() - start_time)
        # expert_matrix = copy.deepcopy(entry)
        expert_matrix[:layer_idx, :] = 0
        for l in range(layer_idx, self.num_layers):
            expert_matrix[l] = (expert_matrix[l] + 1e-8) * self.layer_decay_func(l, layer_idx, self.num_layers)
        return expert_matrix

Mar 29 '25 07:03 wuooo339

Predictor is not applied since the current version in python has too much overhead, better use cache only. I am currently to working on that.

Thanks for the instruction for reproduce, I will try this in the following days.

Mar 29 '25 16:03 drunkcoding

Has there been any progress on this issue? thank you

Oct 11 '25 07:10 dnnyyq

Has there been any progress on this issue? thank you There is a big gap on kernel implementation comparing to SOTA like vLLM, SGLang, or Ollama. We are looking to fill this gap under the branch feature/fastinfer

Oct 14 '25 12:10 drunkcoding

Has there been any progress on this issue? thank you There is a big gap on kernel implementation comparing to SOTA like vLLM, SGLang, or Ollama. We are looking to fill this gap under the branch feature/fastinfer

I’m looking forward to your work. Thank you!

Oct 15 '25 02:10 dnnyyq