mllm icon indicating copy to clipboard operation
mllm copied to clipboard

Prefill speed is approximately 4~6 tokens/s for Qwen1.5-1.8B

Open mengllm opened this issue 6 months ago • 5 comments

Hi, mllm-qnn can work on my device oppo findx7 ultra(snapdragon 8gen 3+16G RAM). However, the prefill speed for Qwen1.5-1.8B is approximately 4-6 tokens per second, which significantly diverges from the 1000 tokens per second claimed in the paper. Based on our tests, npuExe.run takes approximately 15 seconds to process 64 tokens:

        auto startTime = currentMs();

        do {
            // 1: Prefill stage using NPU chunk execute
            npuExe.run(npu_ctx, &npuNet, {input});
	        auto result = npuExe.result();

        int duration = (int) (currentMs() - startTime);
         std::cout << "input_tensor.sequence()=" << input_tensor.sequence() << std::endl;
        std::cout << "prefill cost: " << duration << "ms prefill speed: " << input_tensor.sequence() * 1000 / duration << "token/s" << std::endl;

Could you provide some suggestions?

mengllm avatar Aug 14 '24 03:08 mengllm