Martin Evans
Martin Evans
What type of memory does your server have? Language models are usually limited mostly by memory bandwidth.
That's actually intentional, an approximation copied from llama cpp. CPU utilisation isn't the right thing to measure, you need to look at tokens per second, if you're memory bound adding...
Sorry by "Memory bound" I didn't mean quantity, it would have been more correct to say memory **bandwidth** bound. That's usually the limiting factor for LLMs
This problem is that this is extremely hardware dependent. For example on my own PC (16 physical cores with hyperthreading so 32 cores): | threads | time | |---------|------| |...
For reference (if anyone wants to modify it) the default is implemented here: https://github.com/SciSharp/LLamaSharp/blob/master/LLama/Extensions/IContextParamsExtensions.cs#L53
To be clear I have 32 **logical** cores (i.e. `Environment.ProcessorCount == 32`), so that's why I tested all the way to 32 (I'm using a [Ryzen 7950X](https://en.wikipedia.org/wiki/List_of_AMD_Ryzen_processors#Ryzen_7000_series)). > For optimal...
That's interesting! Definitely looks like it could be close to what we want. Do you know how this behaves on Linux/MacOS (i.e. does it run but return no results, or...
Unfortunately it looks like Github Actions doesn't have Windows+ARM available ([docs](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories)) :( Edit: Note that's not a total blocker for this. I think we could cross compile the DLLs from...
The current LLamaSharp version (0.15.0) is compatible with llama.cpp [b3479](https://github.com/ggerganov/llama.cpp/releases/tag/b3479). You need to make sure you're using that version if you're loading custom binaries.
I believe #565 should support this model, but I haven't tested it. @SidAtBluB0X if you could pull that branch and test that model out it'd be very helpful :)