fairydreaming comments

Results 85 comments of


                                            fairydreaming

Load all MoE experts during warmup

I reimplemented this on the current master. This time added a proper API call for enabling warmup mode: ``` LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup); ```

Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems

@nekiee13 Sorry, never used llama.cpp on Windows (why do you torture yourself with this abomination?) so can't help with that. Maybe model loading simply takes a long time and you...

Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems

@nekiee13 Great that it worked for you! Regarding the MoE models my initial tests on DeepSeek R1 also found limited improvement when using 2 CPUs, so now I'm going to...

Feature Request: Nemotron-4-340B-Instruct Support

I started working on this a few days ago and so far it's going well. Will post the code on github branch after cleaning it up a bit. https://www.youtube.com/watch?v=TX0eppc88TU

Feature Request: Nemotron-4-340B-Instruct Support

My code for brave souls: 1. Model conversion script based on @FailSpy earlier work: https://github.com/fairydreaming/export-nemo-to-safetensors 2. llama.cpp Nemotron 4 branch: https://github.com/fairydreaming/llama.cpp/tree/nemotron There is a new tokenizer in the code (it's...

Feature Request: Nemotron-4-340B-Instruct Support

@leafspark I have no idea what's wrong, maybe try installing the exact versions of packages that I used: [convert-nemo-conda-pkgs.txt](https://github.com/user-attachments/files/16228179/convert-nemo-conda-pkgs.txt) I installed latest versions of packages from conda-forge.

Feature Request: Nemotron-4-340B-Instruct Support

@leafspark But this 847249408 number looks worrying (it's the length of the tensor data buffer), make sure that your model is fully downloaded. This tensor shall have buffer size of...

Feature Request: Nemotron-4-340B-Instruct Support

> How much overlap is there between Nemotron and Mistral NeMo? > > https://mistral.ai/news/mistral-nemo/ > > The Mistral blog post says that the model was developed in conjunction with NVidia,...

Feature Request: Avoid loading GPU layers into RAM before moving them to VRAM. This should allow the use of --no-mmap with models that do not fit in RAM but fit in RAM+VRAM.

@ThomasBaruzier What is the context size that you use? Because the larger the context size, the more memory is used by the KV buffer, I'm not sure if you take...

NUMA-aware KV cache buffer type (experimental)

> Sad to say I also found exactly the same performance on both HEAD as well as this PR branch. I tried a half-dozen times with llama-bench and re-verified I...