fairydreaming comments

Results 85 comments of


                                            fairydreaming

llama : refactor llama_kv_cache, llama_context and llm_build_context

> > This could be useful ... and for selection of attention implementation for DEEPSEEK2 architecture (naive vs MLA - now they directly map to two different llama context types...

Feature Request: Proper Llama 3.1 Support in llama.cpp

I did some local tests of Q8_0 8B model in llama.cpp with 4096 context size and with low temperature set (0.01) it often enters generation loops repeating the same sentences...

Feature Request: MiniMax-Text-01 model

I have something more or less working here: https://github.com/fairydreaming/llama.cpp/tree/minimax-text-01 Some major remaining problems: - It currently doesn't support multiple token sequences. My current implementation of lightning attention simply ignores the...

Feature Request: MiniMax-Text-01 model

I noticed a problem with the model "eating" some words when asked to repeat text (Q5_K_M quant). Can someone with more RAM (like 512GB or 1TB) test this model with...

Feature Request: MiniMax-Text-01 model

I found about `llama_sbatch::split_equal`, so my branch now supports inference of multiple token sequences with llama-server. Prompt caching should be disabled for now, it doesn't work correctly. Run the server...

Feature Request: MiniMax-Text-01 model

> [@fairydreaming](https://github.com/fairydreaming) tested your branch with Q5_K_M. On my setup I see some missing "of". Tested on AMD EPYC and 768G RAM. Can you share your full command test to...

Feature Request: MiniMax-Text-01 model

@Nondzu OK, if it happens on Q8_0 then likely there's still some problem with my inference code as I didn't observe this behavior via API in OpenRouter. Thanks for testing!

Feature Request: Tensor Parallelism support

@ClarkChin08 I'm interested in your experiments with tensor parallelism on Xeon CPUs. Can you tell me more details about the hardware you used? What was the latency between cluster nodes?...

Cannot run T5-based models

@kyteinsky Inference with T5-like models requires using some new API functions (e.g. `llama_encode()`). Without this not only the llama_cpp_python, but all other software based on llama.cpp won't be able to...

Cannot run T5-based models

~~@kyteinsky Sure, is the python wrapper to use, but it's like a completely separate project from llama.cpp. It's not a part of llama.cpp project, so I think feature requests like...