fairydreaming
fairydreaming
Fixes #6877 Contains the following changes: - increases maximum number of experts from 60 to 128 - adds new tensor type FFN_NORM_EXP (for a normalization block before MoE that runs...
I think there is a bug in calculation of max_score in unigram_model.cc: https://github.com/google/sentencepiece/blob/6225e08edb2577757163b3f5dbba4c0b670ef445/src/unigram_model.cc#L657-L664 As FLT_MIN is a very small positive number (on my system it's 1.17549435e-38) and token scores are...
For example in ggml.c implementations of ops related to flash attention declare variable D and use it as both dimension of value vector and dimension or key/query vector This will...
This PR introduces various optimizations for DeepSeek V2/V3 implementation: - caching latent representations instead of full key/value vectors - replaced "naive" attention implementation with implementation based on intermediate representations (https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py)...
I have temporary access to a dual Epyc Turin system and found a little trick that restores normal token generation performance in llama.cpp on dual Epyc systems. The trick is...
This PR is a somewhat crude hack that allows to load all experts in MoE models during warmup. The hacky part is the warmup detection - I explicitly examine the...
This PR contains an experimental NUMA-aware KV cache buffer implementation so that people can try it and check if it improves performance on multi-CPU systems. IMPORTANT: this mechanism works only...