What will SmallThinker do in the prefill stage?

Open hebangwen opened this issue 5 months ago • 0 comments

Prerequisites

Before submitting your question, please ensure the following:

[x] I am running the latest version of PowerInfer. Development is rapid, and as of now, there are no tagged versions.
[x] I have carefully read and followed the instructions in the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).

In PoweInfer paper, the authors said the model turns into dense mode in the prefill stage. So PowerInfer can have a latency comparable to llama.cpp.

However, in SmallThinker, the MoE/FFN weight is offloaded to the disk. I have some questions about the prefill stage:

I think the "DP-Groups Global Load Balance Loss" can address this issue for shorter inputs, which explicitly requies the same experts for neighbor tokens. But will all experts be loaded into memory due to different group requiring different experts in long sequence?
The sparsity seems working in the decoding stage. Will the whole FFN be used in the prefill stage, which means no sparsity in prefill stage?

Aug 08 '25 11:08 hebangwen