fairydreaming issues

Results 7 issues of


                                            fairydreaming

Added support for the ArcticForCausalLM.

Fixes #6877 Contains the following changes: - increases maximum number of experts from 60 to 128 - adds new tensor type FFN_NORM_EXP (for a normalization block before MoE that runs...

enhancement

review complexity : medium

Wrong calculation of max_score in unigram_model.cc

I think there is a bug in calculation of max_score in unigram_model.cc: https://github.com/google/sentencepiece/blob/6225e08edb2577757163b3f5dbba4c0b670ef445/src/unigram_model.cc#L657-L664 As FLT_MIN is a very small positive number (on my system it's 1.17549435e-38) and token scores are...

bug

Will fix in next release

Flash attention implementations do not handle case where value vectors have different dimension from query vectors

For example in ggml.c implementations of ops related to flash attention declare variable D and use it as both dimension of value vector and dimension or key/query vector This will...

enhancement

Optimized DeepSeek V2/V3 implementation (MLA)

105

This PR introduces various optimizations for DeepSeek V2/V3 implementation: - caching latent representations instead of full key/value vectors - replaced "naive" attention implementation with implementation based on intermediate representations (https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py)...

python

fairydreaming

Added support for the ArcticForCausalLM.

Wrong calculation of max_score in unigram_model.cc

Flash attention implementations do not handle case where value vectors have different dimension from query vectors

Optimized DeepSeek V2/V3 implementation (MLA)

Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems

Load all MoE experts during warmup

NUMA-aware KV cache buffer type (experimental)