llama.cpp Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes!

First of all thremendous work Georgi! I managed to run your project with a small adjustments on:

Intel(R) Core(TM) i7-10700T CPU @ 2.00GHz / 16GB as x64 bit app, it takes around 5GB of RAM.

Here is the list of those small fixes:

main.cpp: added ggml_time_init() at start of main (division by zero otherwise)
quantize.cpp: same as above at start of main (division by zero otherwise)
ggml.c: #define QK 32 moved to dedicated define.h (should not be in .c)
ggml.c: replace fopen with fopen_s (VS secure error message)
ggml.c: below changes due to 'expression must be a pointer or complete object type':

2x (uint8_t*)(y to: ((uint8_t*)y
4x (const uint8_t*)(x to ((const uint8_t*)x
2x (const uint8_t*)(y to ((const uint8_t*)y

quantize.cpp: removed qk in ggml_quantize_q4_0 & ggml_quantize_q4_1 calls
utils.cpp: use of QK value instead of parameter value (VS raise error for: uint8_t pp[qk / 2];)

It would be really great if you could incorporate those small fixes.

Mar 11 '23 20:03 bsiminski

Interesting, doing these changes (and a couple of more hacks) I was able to run the 13B model on my HW (AMD Ryzen 7 3700X 8-Core Processor, 3593 Mhz, 8 Core(s), 16 Logical Processor(s), 32gb RAM) and I was able to get 268ms per token, with around 8GB of ram usage!

I forced the usage of AVX2 and that gave a huge speed up.

Mar 11 '23 22:03 etra0

@etra0 here are my 13B model tests, based on number of threads & AVX2 (thanks!):

4: 3809.57 ms per token (default settings) 8: 3617.09 ms per token (default settings) 12: 2967.79 ms per token (default settings)

4: 495.08 ms per token (with AVX2) 8: 519.78 ms per token (with AVX2) 12: 490.53 ms per token (with AVX2)

Clearly AVX2 gives a huge boost. I see however that you are still way ahead with your 268 ms. What other optimizations do you have?

Mar 11 '23 23:03 bsiminski

Yes, AVX2 flags are very important for high performance. Could you wrap these changes in a PR?

ggml.c: #define QK 32 moved to dedicated define.h (should not be in .c)

This is not very desirable - I don't want an extra file added. Although the QK constants everywhere are indeed problematic. Some other fix?

Mar 11 '23 23:03 ggerganov

Could you wrap these changes in a PR?

I could do that, but I'm unsure whether to create a Solution, or move the project to CMake, because Windows doesn't support Make by default, sadly.

I always try to avoid Solutions because they're not multiplatform, but from looking at the makefile, rewriting it to CMake would take a bit more time. In the meantime I could do a PR to fix the things that won't compile.

Mar 11 '23 23:03 etra0

CMake is better than Solutions. The https://github.com/ggerganov/whisper.cpp project has a CMake build system that is compatible for Windows and the project is very similar. It should be easy to adapt

Mar 11 '23 23:03 ggerganov

Great! These changes finally fixed compilation for me using VS cl command (#2) and also cmake with @etra0 repo.

I get 140ms per token on i9900k and about 5gb ram usage with 7B.

Unfortunately bigger prompts are kind of unusable. Dno if it's windows issue or this library isn't yet optimized in this case. Making hardcoded 512 token limit a parameter was easy to change but it's too slow as it repeats all prompt tokens.

Mar 12 '23 05:03 kamyker

@kamyker

Maybe the context size has to be increased - it's currently hardcoded to 512:

https://github.com/ggerganov/llama.cpp/blob/da1a4ff01f42d058cfa59806dd5679c0fe5a8604/main.cpp#L768

Haven't tested if it works with other values

Mar 12 '23 06:03 ggerganov

I didn't see your PR when I read the issue so went ahead and made one, very similar. I made the existing Makefile work on Unix and Microsoft nmake. https://github.com/ggerganov/llama.cpp/pull/36

Mar 12 '23 06:03 jaykrell

Using the fix in #31, however, the results from 4 bit models are still repetitive nonsense. FP16 works but the results are also very bad.

Relevant spec: Intel-13700k, 240ms/token Built make.exe with mwing64

Mar 12 '23 07:03 0xbitches

@kamyker

Maybe the context size has to be increased - it's currently hardcoded to 512:

https://github.com/ggerganov/llama.cpp/blob/da1a4ff01f42d058cfa59806dd5679c0fe5a8604/main.cpp#L768

Haven't tested if it works with other values

As I said, I made parameter out of it and it fixes longer prompts but they are still slow. What I'm saying is that without some kind of quicker prompt loading/caching this is very far from ChatGPT.

How's let's say 300 tokens prompt for you?

Mar 12 '23 07:03 kamyker

Any chance we could publish binaries for windows?

Mar 12 '23 08:03 teknium1

@teknium1

Any chance we could publish binaries for windows?

Here https://github.com/jaykrell/llama.cpp/releases/tag/1 but perhaps that is kinda rude of me. I'll delete if there are objections.

Mar 12 '23 09:03 jaykrell

Here is an updated fork based on initial adjustments done by @etra0 Visual Studio 2022 - vsproj version

@etra0 kindly ask to merge my pull request and push it to @ggerganov repo.

Mar 12 '23 10:03 bsiminski

Here is an updated fork based on initial adjustments done by @etra0 Visual Studio 2022 - vsproj version

@etra0 kindly ask to merge my pull request and push it to @ggerganov repo.

I don't think I'll merge this, sadly. I don't want to add solutions to the project, I'd rather go with the nmake solution or finish writing the CMake.

Mar 12 '23 15:03 etra0

@jaykrell thank you for your work, I've tried it and it worked! However the quantizer seemed like it run, but didn't produce any bin files (tried with 7b and 13b), but I could run with the original model on an i5-9600k 10 times slower, but. :D

Mar 12 '23 16:03 kbalint

@ggerganov would this suport merge to master?

Mar 13 '23 12:03 lucasjinreal

Successfully compilied this on msys2(ucrt).

Mar 13 '23 13:03 ShouNichi

I did the initial draft for CMake support which allows this to be built for Windows as well. You can check the PR at https://github.com/ggerganov/llama.cpp/pull/75.

If you pull my changes, you can build the project with the following instructions:

# Assuming you're using PowerShell
mkdir build
cd build
cmake ..
cmake --build . --config Release

That will build the two executables, quantize.exe, llama.exe, then you can use it from the root llama.cpp directory like

./build/Release/llama.exe -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128

The PR is a draft because I need to also update the instructions I guess, but it's pretty much usable right now.

EDIT: You can also open the llama.cpp folder in Visual Studio 2019 or newer and it should detect the CMake settings automatically and then just build it.

cc @jinfagang.

Mar 13 '23 13:03 etra0

That will build the two executables, quantize.exe, llama.exe, then you can use it from the root llama.cpp directory like

Small feedback: llama.exe should be renamed to main.exe somewhere to be consistent with readme commands.

Mar 13 '23 15:03 kamyker

I was able to build with clang (from VS2022 prompt), without any changes:

clang -march=native -O3 -fuse-ld=lld-link -flto main.cpp ggml.c utils.cpp
clang -march=native -O3 -fuse-ld=lld-link -flto quantize.cpp ggml.c utils.cpp

Seems to be 10% faster (than timings in #39), ymmv.

Mar 13 '23 15:03 bitRAKE

I did the initial draft for CMake support which allows this to be built for Windows as well. You can check the PR at #75.

If you pull my changes, you can build the project with the following instructions:
# Assuming you're using PowerShell
mkdir build
cd build
cmake ..
cmake --build . --config Release
That will build the two executables, quantize.exe, llama.exe, then you can use it from the root llama.cpp directory like

./build/Release/llama.exe -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128

The PR is a draft because I need to also update the instructions I guess, but it's pretty much usable right now.

EDIT: You can also open the llama.cpp folder in Visual Studio 2019 or newer and it should detect the CMake settings automatically and then just build it.

cc @jinfagang.

I installed VS 2022 build tools, installed MSVC and cmake

But I get this error:

C:\Users\quela\Downloads\LLaMA\llama.cpp\build>cmake ..
-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version  to target Windows 10.0.22621.
-- The C compiler identification is unknown
-- The CXX compiler identification is unknown
CMake Error at CMakeLists.txt:2 (project):
  No CMAKE_C_COMPILER could be found.



CMake Error at CMakeLists.txt:2 (project):
  No CMAKE_CXX_COMPILER could be found.



-- Configuring incomplete, errors occurred!
See also "C:/Users/quela/Downloads/LLaMA/llama.cpp/build/CMakeFiles/CMakeOutput.log".
See also "C:/Users/quela/Downloads/LLaMA/llama.cpp/build/CMakeFiles/CMakeError.log".

What am I doing wrong?

Mar 13 '23 20:03 Zerogoki00

@Zerogoki00 From the looks of it, it seems that you have no C/C++ compiler. Did you make sure selecting C++ development when installing build tools?

Mar 13 '23 21:03 etra0

Builds fine for me.

Interactive mode doesn't work correctly, program ends after first generation.

Mar 14 '23 02:03 kamyker

help me please

main: seed = 1678814584 llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ... llama_model_load: failed to open './models/7B/ggml-model-q4_0.bin' main: failed to load model from './models/7B/ggml-model-q4_0.bin'

Mar 14 '23 17:03 1octopus1

help me please

Mar 14 '23 19:03 1octopus1

@1octopus1 those warnings are 'normal' as-in, it doesn't have to do with your errors. Did you do all the rest of the steps? quantize the model and all? the fixes mentioned here are just to build main (llama.exe) and quantize (quantize.exe), you still need to follow the rest of the README.

I know we still need to update the instructions for Windows, but I just haven't found the time yet.

Mar 14 '23 19:03 etra0

@1octopus1 those warnings are 'normal' as-in, it doesn't have to do with your errors. Did you do all the rest of the steps? quantize the model and all? the fixes mentioned here are just to build main (llama.exe) and quantize (quantize.exe), you still need to follow the rest of the README.

I know we still need to update the instructions for Windows, but I just haven't found the time yet.

Yes, I did everything according to the instructions. Okay, I'll wait for the updated instructions. And then several hours trying to start =) Just write it in detail, please, with each step =) Thank you very much.

Mar 14 '23 20:03 1octopus1

Interactive Mode not working right. It returns to the Bash command prompt after the first message: $ ./Release/llama.exe -m ../../../Users/ron/llama.cpp/models/7B/ggml-model-q4_0.bin -t 8 --repeat_penalty 1.2 --temp 0.9 --top_p 0.9 -n 256 --color -i -r "User:" -p "Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision."

== Running in interactive mode. ==

Press Return to return control to LLaMa.
If you want to submit another line, end your input in ''. Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision. User:I really want a pizza. Assistant [Bob]:OK then, what do you want? [end of text]

main: mem per token = 14434244 bytes main: load time = 3234.19 ms main: sample time = 12.65 ms main: predict time = 12828.59 ms / 183.27 ms per token main: total time = 31762.91 ms (venv) ron@LAPTOP-JIBCUHGM MINGW64 /c/llama/llama.cpp/build (master) $

Mar 15 '23 02:03 RedLeader721

Hello everyone) how do I install it, and how to turn it on and off on my PC, who can explain? I have Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz 2.40GHz . 12.0 GB (available: 11.9 GB); Windows 11 Pro. I hope it will work fine.

Mar 15 '23 16:03 eldash666

Assuming you are at a VS2022 command prompt and you've installed git/cmake support through the VS Installer:

set PATH=%DevEnvDir%CommonExtensions\Microsoft\TeamFoundation\Team Explorer\Git\cmd;%PATH%
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release

If you installed another git then that first line might not be needed. Yeah, MS decided not to add git to the path, doh!

Building the repo gives you llama.exe and quantize.exe in the llama.cpp\build\Release directory. You'll need to convert and quantize the model by following the directions for that.

I can't really help beyond that because I have a different build environment I'm using clang from the terminal.

@eldash666 12GB might be tight.

@RedLeader721, Interactive mode has several issues. First #120 is need for Windows support for Ctrl-C handler. Second it's possible for the reverse prompt to appear as different tokens and be ignored. Also, I'd try a better prompt #199 - give an example or two, lead the model as to what you want and it will follow.

Mar 15 '23 17:03 bitRAKE

llama.cpp llama.cpp copied to clipboard

Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes!

llama.cpp
llama.cpp copied to clipboard