llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Load all MoE experts during warmup

Open fairydreaming opened this issue 10 months ago • 4 comments

This PR is a somewhat crude hack that allows to load all experts in MoE models during warmup.

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.

If the model is warming up then n_expert_used is set to n_expert, this will cause all existing experts to be loaded to memory during warmup.

Fixes #11163

fairydreaming avatar Feb 01 '25 09:02 fairydreaming

A quick test with R1 on llama-server shows all experts loaded into memory during warmup. Inference started immediately once the web interface was available. I will try a test on a non-MoE large model as well to make sure there are no regressions in that case. Thanks for this fix!

cpumaxx avatar Feb 03 '25 17:02 cpumaxx

I can confirm this is working for me and loads a couple of times faster than letting it warm up "naturally" (can see it uses ~2.5 cores instead of ~0.5 cores so possibly due to avoiding random access on the SSD?)

jukofyork avatar Feb 06 '25 21:02 jukofyork

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.

I'll consider adding proper support for this in https://github.com/ggerganov/llama.cpp/pull/11213.

ggerganov avatar Feb 07 '25 08:02 ggerganov

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.

I'll consider adding proper support for this in #11213.

@ggerganov if you are going to work on warmup then take a look at this: #11733

TLDR: Using 1-token-long sequence (instead of current 2 BOS and EOS tokens) in the warmup batch fixes token generation performance bottleneck (+80% to tg t/s with llama-3.1 70b f16) on dual Epyc systems.

fairydreaming avatar Feb 09 '25 11:02 fairydreaming

@fairydreaming Any chance you can resolve the conflicts for this PR?

I was just about to do the final tests on the MLA PR but need this and https://github.com/ggml-org/llama.cpp/pull/11397 to do it! :)

jukofyork avatar Mar 13 '25 19:03 jukofyork

@jukofyork It's not a matter of resolving the conflicts. Since #12181 is now merged the code on which I based this little hack is no longer there. It would have to be basically implemented from scratch based on the current code.

I guess I will close it for now, as it's no longer a valid solution.

fairydreaming avatar Mar 13 '25 20:03 fairydreaming

@fairydreaming yeah, I realised after asking just how extensive the changes have been! 😮

I've just resorted to capturing a copy of the master before all the changes and gonna wait until things settle down.

jukofyork avatar Mar 13 '25 23:03 jukofyork

I reimplemented this on the current master. This time added a proper API call for enabling warmup mode:

LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup);

fairydreaming avatar Mar 14 '25 09:03 fairydreaming