Load all MoE experts during warmup
This PR is a somewhat crude hack that allows to load all experts in MoE models during warmup.
The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.
If the model is warming up then n_expert_used is set to n_expert, this will cause all existing experts to be loaded to memory during warmup.
Fixes #11163
A quick test with R1 on llama-server shows all experts loaded into memory during warmup. Inference started immediately once the web interface was available. I will try a test on a non-MoE large model as well to make sure there are no regressions in that case. Thanks for this fix!
I can confirm this is working for me and loads a couple of times faster than letting it warm up "naturally" (can see it uses ~2.5 cores instead of ~0.5 cores so possibly due to avoiding random access on the SSD?)
The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.
I'll consider adding proper support for this in https://github.com/ggerganov/llama.cpp/pull/11213.
The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.
I'll consider adding proper support for this in #11213.
@ggerganov if you are going to work on warmup then take a look at this: #11733
TLDR: Using 1-token-long sequence (instead of current 2 BOS and EOS tokens) in the warmup batch fixes token generation performance bottleneck (+80% to tg t/s with llama-3.1 70b f16) on dual Epyc systems.
@fairydreaming Any chance you can resolve the conflicts for this PR?
I was just about to do the final tests on the MLA PR but need this and https://github.com/ggml-org/llama.cpp/pull/11397 to do it! :)
@jukofyork It's not a matter of resolving the conflicts. Since #12181 is now merged the code on which I based this little hack is no longer there. It would have to be basically implemented from scratch based on the current code.
I guess I will close it for now, as it's no longer a valid solution.
@fairydreaming yeah, I realised after asking just how extensive the changes have been! 😮
I've just resorted to capturing a copy of the master before all the changes and gonna wait until things settle down.
I reimplemented this on the current master. This time added a proper API call for enabling warmup mode:
LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup);