fairydreaming

Results 85 comments of fairydreaming

I reimplemented this on the current master. This time added a proper API call for enabling warmup mode: ``` LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup); ```

@nekiee13 Sorry, never used llama.cpp on Windows (why do you torture yourself with this abomination?) so can't help with that. Maybe model loading simply takes a long time and you...

@nekiee13 Great that it worked for you! Regarding the MoE models my initial tests on DeepSeek R1 also found limited improvement when using 2 CPUs, so now I'm going to...

I started working on this a few days ago and so far it's going well. Will post the code on github branch after cleaning it up a bit. https://www.youtube.com/watch?v=TX0eppc88TU

My code for brave souls: 1. Model conversion script based on @FailSpy earlier work: https://github.com/fairydreaming/export-nemo-to-safetensors 2. llama.cpp Nemotron 4 branch: https://github.com/fairydreaming/llama.cpp/tree/nemotron There is a new tokenizer in the code (it's...

@leafspark I have no idea what's wrong, maybe try installing the exact versions of packages that I used: [convert-nemo-conda-pkgs.txt](https://github.com/user-attachments/files/16228179/convert-nemo-conda-pkgs.txt) I installed latest versions of packages from conda-forge.

@leafspark But this 847249408 number looks worrying (it's the length of the tensor data buffer), make sure that your model is fully downloaded. This tensor shall have buffer size of...

> How much overlap is there between Nemotron and Mistral NeMo? > > https://mistral.ai/news/mistral-nemo/ > > The Mistral blog post says that the model was developed in conjunction with NVidia,...

@ThomasBaruzier What is the context size that you use? Because the larger the context size, the more memory is used by the KV buffer, I'm not sure if you take...

> Sad to say I also found exactly the same performance on both HEAD as well as this PR branch. I tried a half-dozen times with llama-bench and re-verified I...