PowerInfer Add Windows CPU/GPU CMake support

environment:

w64devkit-1.21.0: provide make tools, see details in https://github.com/ggerganov/llama.cpp/tree/master-ff966e7?tab=readme-ov-file#build use:
Run w64devkit.exe
Use the cd command to reach the PowerInfer folder
make

Dec 24 '23 13:12 bobozi-cmd

Thanks @bobozi-cmd for your contribution! Would you mind delete useless code instead of commenting them out? That would give us a cleaner codebase.

Despite that, LGTM! Will be tested and merged soon.

Dec 24 '23 13:12 hodlen

Windows GPU Make Method Use:

make LLAMA_CUBLAS=1 to build GPU version PowerInfer with w64devkit.exe Note:
If encounter error like this, check your CUDA_PATH whether has space, ensure this then make clean and remake

Dec 24 '23 14:12 bobozi-cmd

cuda12.1 windows won't compile

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -march=native -mtune=native -Xassembler -muse-unaligned-vector-move    -c ggml.c -o ggml.o
ggml.c:48:23: error: conflicting type qualifiers for 'atomic_int'
   48 | typedef volatile LONG atomic_int;
      |                       ^~~~~~~~~~
In file included from ggml.h:216,
                 from ggml-impl.h:3,
                 from ggml.c:4:
E:/Langchain-Chatchat/w64devkit/lib/gcc/x86_64-w64-mingw32/13.2.0/include/stdatomic.h:46:21: note: previous declaration of 'atomic_int' with type 'atomic_int' {aka '_Atomic int'}
   46 | typedef _Atomic int atomic_int;
      |                     ^~~~~~~~~~
ggml.c:49:20: error: conflicting type qualifiers for 'atomic_bool'
   49 | typedef atomic_int atomic_bool;
      |                    ^~~~~~~~~~~
E:/Langchain-Chatchat/w64devkit/lib/gcc/x86_64-w64-mingw32/13.2.0/include/stdatomic.h:40:23: note: previous declaration of 'atomic_bool' with type 'atomic_bool' {aka '_Atomic _Bool'}
   40 | typedef _Atomic _Bool atomic_bool;
      |                       ^~~~~~~~~~~
ggml.c:51:13: error: expected identifier or '(' before '__extension__'
   51 | static void atomic_store(atomic_int * ptr, LONG val) {
      |             ^~~~~~~~~~~~
ggml.c:51:13: error: expected identifier or '(' before ')' token
   51 | static void atomic_store(atomic_int * ptr, LONG val) {
      |             ^~~~~~~~~~~~
ggml.c:54:13: error: expected identifier or '(' before '__extension__'
   54 | static LONG atomic_load(atomic_int * ptr) {
      |             ^~~~~~~~~~~
ggml.c:54:13: error: expected identifier or '(' before ')' token
   54 | static LONG atomic_load(atomic_int * ptr) {
      |             ^~~~~~~~~~~
ggml.c:57:13: error: expected declaration specifiers or '...' before '(' token
   57 | static LONG atomic_fetch_add(atomic_int * ptr, LONG inc) {
      |             ^~~~~~~~~~~~~~~~
ggml.c:57:13: error: expected declaration specifiers or '...' before '(' token
   57 | static LONG atomic_fetch_add(atomic_int * ptr, LONG inc) {
      |             ^~~~~~~~~~~~~~~~
ggml.c:57:13: error: expected declaration specifiers or '...' before numeric constant
   57 | static LONG atomic_fetch_add(atomic_int * ptr, LONG inc) {
      |             ^~~~~~~~~~~~~~~~
ggml.c:60:13: error: expected declaration specifiers or '...' before '(' token
   60 | static LONG atomic_fetch_sub(atomic_int * ptr, LONG dec) {
      |             ^~~~~~~~~~~~~~~~
ggml.c:60:13: error: expected declaration specifiers or '...' before '(' token
   60 | static LONG atomic_fetch_sub(atomic_int * ptr, LONG dec) {
      |             ^~~~~~~~~~~~~~~~
ggml.c:60:13: error: expected declaration specifiers or '...' before numeric constant
   60 | static LONG atomic_fetch_sub(atomic_int * ptr, LONG dec) {
      |             ^~~~~~~~~~~~~~~~
ggml.c: In function 'ggml_graph_compute_thread':
ggml.c:16993:30: warning: initialization of '_Atomic atomic_int *' {aka '_Atomic int *'} from incompatible pointer type 'volatile atomic_int *' {aka 'volatile long int *'} [-Wincompatible-pointer-types]
16993 |                 /*.aic   =*/ &state->shared->aic,
      |                              ^
ggml.c:16993:30: note: (near initialization for 'params.aic')
ggml.c:17077:26: warning: initialization of '_Atomic atomic_int *' {aka '_Atomic int *'} from incompatible pointer type 'volatile atomic_int *' {aka 'volatile long int *'} [-Wincompatible-pointer-types]
17077 |             /*.aic   =*/ &state->shared->aic,
      |                          ^
ggml.c:17077:26: note: (near initialization for 'params.aic')
ggml.c: In function 'ggml_graph_compute_thread_hybrid':
ggml.c:17183:30: warning: initialization of '_Atomic atomic_int *' {aka '_Atomic int *'} from incompatible pointer type 'volatile atomic_int *' {aka 'volatile long int *'} [-Wincompatible-pointer-types]
17183 |                 /*.aic   =*/ &state->shared->aic,
      |                              ^
ggml.c:17183:30: note: (near initialization for 'params.aic')
ggml.c:17284:26: warning: initialization of '_Atomic atomic_int *' {aka '_Atomic int *'} from incompatible pointer type 'volatile atomic_int *' {aka 'volatile long int *'} [-Wincompatible-pointer-types]
17284 |             /*.aic   =*/ &state->shared->aic,
      |                          ^
ggml.c:17284:26: note: (near initialization for 'params.aic')
make: *** [Makefile:533: ggml.o] Error 1

Dec 26 '23 03:12 xldistance

Intel 12100F, 16 GB 3200Mhz, pcie 4.0 SSD, Nvidia RTX 3060 12GB, Windows 11, Python 3.11, Visual Studio 2022, CUDA 12.1, 546.33 driver with sysmem fallback . https://github.com/bobozi-cmd/PowerInfer/commit/9593fe605257a1e7362a25a1019c67b0a5194c5a

git clone https://github.com/bobozi-cmd/PowerInfer
cd PowerInfer
pip install -r requirements.txt
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

C:\Users\Windows\AI\powerinfer.exe --vram-budget 12 -t 7 -m C:\Users\Windows\AI\llama-7b-relu.powerinfer.gguf --repeat_penalty 1 --no-penalize-nl --color --temp 0 --top-k 50 --top-p 1 -c 2048 -n 256 --seed 1 -p "Once upon the time"

llama_print_timings:        load time =    8458.75 ms
llama_print_timings:      sample time =      14.00 ms /   179 runs   (    0.08 ms per token, 12789.37 tokens per second)
llama_print_timings: prompt eval time =   13564.37 ms /     5 tokens ( 2712.87 ms per token,     0.37 tokens per second)
llama_print_timings:        eval time =   36187.15 ms /   178 runs   (  203.30 ms per token,     4.92 tokens per second)
llama_print_timings:       total time =   49795.77 ms

It works but it's 10 times slower (4.92 tokens per second vs 49.04 tokens per second on llama.cpp).

Dec 27 '23 13:12 cmhamiche

There are many questions on windows and current test result may not reliable , we are trying to fix them, please wait for us, thank you.

Dec 27 '23 13:12 bobozi-cmd

After discussion with @bobozi-cmd, we will continue the work on #114 and close this PR.

Jan 03 '24 13:01 hodlen

Hello, Thanks for your efforts. I have been trying to run on windows, through w64devkit, and while it worked and offloaded to the GPU, it's very slow. I used the llama-7b model. It takes more than a few seconds to generate one token.

My device: RTX 2060.

Jun 19 '24 10:06 YaserAlOsh