fairydreaming

Results 7 issues of fairydreaming

Fixes #6877 Contains the following changes: - increases maximum number of experts from 60 to 128 - adds new tensor type FFN_NORM_EXP (for a normalization block before MoE that runs...

enhancement
review complexity : medium

I think there is a bug in calculation of max_score in unigram_model.cc: https://github.com/google/sentencepiece/blob/6225e08edb2577757163b3f5dbba4c0b670ef445/src/unigram_model.cc#L657-L664 As FLT_MIN is a very small positive number (on my system it's 1.17549435e-38) and token scores are...

bug
Will fix in next release

For example in ggml.c implementations of ops related to flash attention declare variable D and use it as both dimension of value vector and dimension or key/query vector This will...

enhancement

This PR introduces various optimizations for DeepSeek V2/V3 implementation: - caching latent representations instead of full key/value vectors - replaced "naive" attention implementation with implementation based on intermediate representations (https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py)...

python

I have temporary access to a dual Epyc Turin system and found a little trick that restores normal token generation performance in llama.cpp on dual Epyc systems. The trick is...

This PR is a somewhat crude hack that allows to load all experts in MoE models during warmup. The hacky part is the warmup detection - I explicitly examine the...

This PR contains an experimental NUMA-aware KV cache buffer implementation so that people can try it and check if it improves performance on multi-CPU systems. IMPORTANT: this mechanism works only...

ggml