llama.cpp
llama.cpp copied to clipboard
Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes!
First of all thremendous work Georgi! I managed to run your project with a small adjustments on:
- Intel(R) Core(TM) i7-10700T CPU @ 2.00GHz / 16GB as x64 bit app, it takes around 5GB of RAM.


Here is the list of those small fixes:
- main.cpp: added ggml_time_init() at start of main (division by zero otherwise)
- quantize.cpp: same as above at start of main (division by zero otherwise)
- ggml.c: #define QK 32 moved to dedicated define.h (should not be in .c)
- ggml.c: replace fopen with fopen_s (VS secure error message)
- ggml.c: below changes due to 'expression must be a pointer or complete object type':
- 2x
(uint8_t*)(y
to:((uint8_t*)y
- 4x
(const uint8_t*)(x
to((const uint8_t*)x
- 2x
(const uint8_t*)(y
to((const uint8_t*)y
- quantize.cpp: removed qk in ggml_quantize_q4_0 & ggml_quantize_q4_1 calls
- utils.cpp: use of QK value instead of parameter value (VS raise error for:
uint8_t pp[qk / 2];
)
It would be really great if you could incorporate those small fixes.
Interesting, doing these changes (and a couple of more hacks) I was able to run the 13B model on my HW (AMD Ryzen 7 3700X 8-Core Processor, 3593 Mhz, 8 Core(s), 16 Logical Processor(s), 32gb RAM) and I was able to get 268ms per token, with around 8GB of ram usage!
I forced the usage of AVX2 and that gave a huge speed up.
@etra0 here are my 13B model tests, based on number of threads & AVX2 (thanks!):
4: 3809.57 ms per token (default settings) 8: 3617.09 ms per token (default settings) 12: 2967.79 ms per token (default settings)
4: 495.08 ms per token (with AVX2) 8: 519.78 ms per token (with AVX2) 12: 490.53 ms per token (with AVX2)
Clearly AVX2 gives a huge boost. I see however that you are still way ahead with your 268 ms. What other optimizations do you have?
Yes, AVX2 flags are very important for high performance. Could you wrap these changes in a PR?
ggml.c: #define QK 32 moved to dedicated define.h (should not be in .c)
This is not very desirable - I don't want an extra file added. Although the QK constants everywhere are indeed problematic. Some other fix?
Could you wrap these changes in a PR?
I could do that, but I'm unsure whether to create a Solution, or move the project to CMake, because Windows doesn't support Make by default, sadly.
I always try to avoid Solutions because they're not multiplatform, but from looking at the makefile, rewriting it to CMake would take a bit more time. In the meantime I could do a PR to fix the things that won't compile.
CMake is better than Solutions. The https://github.com/ggerganov/whisper.cpp project has a CMake build system that is compatible for Windows and the project is very similar. It should be easy to adapt
Great! These changes finally fixed compilation for me using VS cl command (#2) and also cmake with @etra0 repo.
I get 140ms per token on i9900k and about 5gb ram usage with 7B.
Unfortunately bigger prompts are kind of unusable. Dno if it's windows issue or this library isn't yet optimized in this case. Making hardcoded 512 token limit a parameter was easy to change but it's too slow as it repeats all prompt tokens.
@kamyker
Maybe the context size has to be increased - it's currently hardcoded to 512:
https://github.com/ggerganov/llama.cpp/blob/da1a4ff01f42d058cfa59806dd5679c0fe5a8604/main.cpp#L768
Haven't tested if it works with other values
I didn't see your PR when I read the issue so went ahead and made one, very similar. I made the existing Makefile work on Unix and Microsoft nmake. https://github.com/ggerganov/llama.cpp/pull/36
Using the fix in #31, however, the results from 4 bit models are still repetitive nonsense. FP16 works but the results are also very bad.
Relevant spec: Intel-13700k, 240ms/token Built make.exe with mwing64
@kamyker
Maybe the context size has to be increased - it's currently hardcoded to 512:
https://github.com/ggerganov/llama.cpp/blob/da1a4ff01f42d058cfa59806dd5679c0fe5a8604/main.cpp#L768
Haven't tested if it works with other values
As I said, I made parameter out of it and it fixes longer prompts but they are still slow. What I'm saying is that without some kind of quicker prompt loading/caching this is very far from ChatGPT.
How's let's say 300 tokens prompt for you?
Any chance we could publish binaries for windows?
@teknium1
Any chance we could publish binaries for windows?
Here https://github.com/jaykrell/llama.cpp/releases/tag/1 but perhaps that is kinda rude of me. I'll delete if there are objections.
Here is an updated fork based on initial adjustments done by @etra0 Visual Studio 2022 - vsproj version
@etra0 kindly ask to merge my pull request and push it to @ggerganov repo.
Here is an updated fork based on initial adjustments done by @etra0 Visual Studio 2022 - vsproj version
@etra0 kindly ask to merge my pull request and push it to @ggerganov repo.
I don't think I'll merge this, sadly. I don't want to add solutions to the project, I'd rather go with the nmake solution or finish writing the CMake.
@jaykrell thank you for your work, I've tried it and it worked! However the quantizer seemed like it run, but didn't produce any bin files (tried with 7b and 13b), but I could run with the original model on an i5-9600k 10 times slower, but. :D
@ggerganov would this suport merge to master?
Successfully compilied this on msys2(ucrt).
I did the initial draft for CMake support which allows this to be built for Windows as well. You can check the PR at https://github.com/ggerganov/llama.cpp/pull/75.
If you pull my changes, you can build the project with the following instructions:
# Assuming you're using PowerShell
mkdir build
cd build
cmake ..
cmake --build . --config Release
That will build the two executables, quantize.exe, llama.exe
, then you can use it from the root llama.cpp
directory like
./build/Release/llama.exe -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128
The PR is a draft because I need to also update the instructions I guess, but it's pretty much usable right now.
EDIT: You can also open the llama.cpp folder in Visual Studio 2019 or newer and it should detect the CMake settings automatically and then just build it.
cc @jinfagang.
That will build the two executables,
quantize.exe, llama.exe
, then you can use it from the rootllama.cpp
directory like
Small feedback: llama.exe should be renamed to main.exe somewhere to be consistent with readme commands.
I was able to build with clang (from VS2022 prompt), without any changes:
clang -march=native -O3 -fuse-ld=lld-link -flto main.cpp ggml.c utils.cpp
clang -march=native -O3 -fuse-ld=lld-link -flto quantize.cpp ggml.c utils.cpp
Seems to be 10% faster (than timings in #39), ymmv.
I did the initial draft for CMake support which allows this to be built for Windows as well. You can check the PR at #75.
If you pull my changes, you can build the project with the following instructions:
# Assuming you're using PowerShell mkdir build cd build cmake .. cmake --build . --config Release
That will build the two executables,
quantize.exe, llama.exe
, then you can use it from the rootllama.cpp
directory like
./build/Release/llama.exe -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128
The PR is a draft because I need to also update the instructions I guess, but it's pretty much usable right now.
EDIT: You can also open the llama.cpp folder in Visual Studio 2019 or newer and it should detect the CMake settings automatically and then just build it.
cc @jinfagang.
I installed VS 2022 build tools, installed MSVC and cmake
But I get this error:
C:\Users\quela\Downloads\LLaMA\llama.cpp\build>cmake ..
-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version to target Windows 10.0.22621.
-- The C compiler identification is unknown
-- The CXX compiler identification is unknown
CMake Error at CMakeLists.txt:2 (project):
No CMAKE_C_COMPILER could be found.
CMake Error at CMakeLists.txt:2 (project):
No CMAKE_CXX_COMPILER could be found.
-- Configuring incomplete, errors occurred!
See also "C:/Users/quela/Downloads/LLaMA/llama.cpp/build/CMakeFiles/CMakeOutput.log".
See also "C:/Users/quela/Downloads/LLaMA/llama.cpp/build/CMakeFiles/CMakeError.log".
What am I doing wrong?
@Zerogoki00 From the looks of it, it seems that you have no C/C++ compiler. Did you make sure selecting C++ development when installing build tools?
Builds fine for me.
Interactive mode doesn't work correctly, program ends after first generation.
help me please
main: seed = 1678814584 llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ... llama_model_load: failed to open './models/7B/ggml-model-q4_0.bin' main: failed to load model from './models/7B/ggml-model-q4_0.bin'
help me please
@1octopus1 those warnings are 'normal' as-in, it doesn't have to do with your errors. Did you do all the rest of the steps? quantize the model and all? the fixes mentioned here are just to build main (llama.exe) and quantize (quantize.exe), you still need to follow the rest of the README.
I know we still need to update the instructions for Windows, but I just haven't found the time yet.
@1octopus1 those warnings are 'normal' as-in, it doesn't have to do with your errors. Did you do all the rest of the steps? quantize the model and all? the fixes mentioned here are just to build main (llama.exe) and quantize (quantize.exe), you still need to follow the rest of the README.
I know we still need to update the instructions for Windows, but I just haven't found the time yet.
Yes, I did everything according to the instructions. Okay, I'll wait for the updated instructions. And then several hours trying to start =) Just write it in detail, please, with each step =) Thank you very much.
Interactive Mode not working right. It returns to the Bash command prompt after the first message: $ ./Release/llama.exe -m ../../../Users/ron/llama.cpp/models/7B/ggml-model-q4_0.bin -t 8 --repeat_penalty 1.2 --temp 0.9 --top_p 0.9 -n 256 --color -i -r "User:" -p "Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision."
Hello everyone) how do I install it, and how to turn it on and off on my PC, who can explain? I have Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz 2.40GHz . 12.0 GB (available: 11.9 GB); Windows 11 Pro. I hope it will work fine.
Assuming you are at a VS2022 command prompt and you've installed git
/cmake
support through the VS Installer:
set PATH=%DevEnvDir%CommonExtensions\Microsoft\TeamFoundation\Team Explorer\Git\cmd;%PATH%
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release
If you installed another git
then that first line might not be needed. Yeah, MS decided not to add git
to the path, doh!
Building the repo gives you llama.exe
and quantize.exe
in the llama.cpp\build\Release
directory. You'll need to convert and quantize the model by following the directions for that.
I can't really help beyond that because I have a different build environment I'm using clang
from the terminal.
@eldash666 12GB might be tight.
@RedLeader721, Interactive mode has several issues. First #120 is need for Windows support for Ctrl-C handler. Second it's possible for the reverse prompt to appear as different tokens and be ignored. Also, I'd try a better prompt #199 - give an example or two, lead the model as to what you want and it will follow.