mllm
mllm copied to clipboard
Prefill speed is approximately 4~6 tokens/s for Qwen1.5-1.8B
Hi, mllm-qnn can work on my device oppo findx7 ultra(snapdragon 8gen 3+16G RAM). However, the prefill speed for Qwen1.5-1.8B is approximately 4-6 tokens per second, which significantly diverges from the 1000 tokens per second claimed in the paper. Based on our tests, npuExe.run takes approximately 15 seconds to process 64 tokens:
auto startTime = currentMs();
do {
// 1: Prefill stage using NPU chunk execute
npuExe.run(npu_ctx, &npuNet, {input});
auto result = npuExe.result();
int duration = (int) (currentMs() - startTime);
std::cout << "input_tensor.sequence()=" << input_tensor.sequence() << std::endl;
std::cout << "prefill cost: " << duration << "ms prefill speed: " << input_tensor.sequence() * 1000 / duration << "token/s" << std::endl;
Could you provide some suggestions?