fairydreaming
fairydreaming
> > This could be useful ... and for selection of attention implementation for DEEPSEEK2 architecture (naive vs MLA - now they directly map to two different llama context types...
I did some local tests of Q8_0 8B model in llama.cpp with 4096 context size and with low temperature set (0.01) it often enters generation loops repeating the same sentences...
I have something more or less working here: https://github.com/fairydreaming/llama.cpp/tree/minimax-text-01 Some major remaining problems: - It currently doesn't support multiple token sequences. My current implementation of lightning attention simply ignores the...
I noticed a problem with the model "eating" some words when asked to repeat text (Q5_K_M quant). Can someone with more RAM (like 512GB or 1TB) test this model with...
I found about `llama_sbatch::split_equal`, so my branch now supports inference of multiple token sequences with llama-server. Prompt caching should be disabled for now, it doesn't work correctly. Run the server...
> [@fairydreaming](https://github.com/fairydreaming) tested your branch with Q5_K_M. On my setup I see some missing "of". Tested on AMD EPYC and 768G RAM. Can you share your full command test to...
@Nondzu OK, if it happens on Q8_0 then likely there's still some problem with my inference code as I didn't observe this behavior via API in OpenRouter. Thanks for testing!
@ClarkChin08 I'm interested in your experiments with tensor parallelism on Xeon CPUs. Can you tell me more details about the hardware you used? What was the latency between cluster nodes?...
@kyteinsky Inference with T5-like models requires using some new API functions (e.g. `llama_encode()`). Without this not only the llama_cpp_python, but all other software based on llama.cpp won't be able to...
~~@kyteinsky Sure, is the python wrapper to use, but it's like a completely separate project from llama.cpp. It's not a part of llama.cpp project, so I think feature requests like...