PowerInfer
PowerInfer copied to clipboard
Add Windows CPU/GPU CMake support
environment:
- w64devkit-1.21.0: provide make tools, see details in https://github.com/ggerganov/llama.cpp/tree/master-ff966e7?tab=readme-ov-file#build use:
- Run
w64devkit.exe
- Use the
cd
command to reach thePowerInfer
folder -
make
Thanks @bobozi-cmd for your contribution! Would you mind delete useless code instead of commenting them out? That would give us a cleaner codebase.
Despite that, LGTM! Will be tested and merged soon.
Windows GPU Make Method Use:
-
make LLAMA_CUBLAS=1
to build GPU version PowerInfer withw64devkit.exe
Note: - If encounter error like this, check your CUDA_PATH whether has space, ensure this then
make clean
and remake
cuda12.1 windows won't compile
cc -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -march=native -mtune=native -Xassembler -muse-unaligned-vector-move -c ggml.c -o ggml.o
ggml.c:48:23: error: conflicting type qualifiers for 'atomic_int'
48 | typedef volatile LONG atomic_int;
| ^~~~~~~~~~
In file included from ggml.h:216,
from ggml-impl.h:3,
from ggml.c:4:
E:/Langchain-Chatchat/w64devkit/lib/gcc/x86_64-w64-mingw32/13.2.0/include/stdatomic.h:46:21: note: previous declaration of 'atomic_int' with type 'atomic_int' {aka '_Atomic int'}
46 | typedef _Atomic int atomic_int;
| ^~~~~~~~~~
ggml.c:49:20: error: conflicting type qualifiers for 'atomic_bool'
49 | typedef atomic_int atomic_bool;
| ^~~~~~~~~~~
E:/Langchain-Chatchat/w64devkit/lib/gcc/x86_64-w64-mingw32/13.2.0/include/stdatomic.h:40:23: note: previous declaration of 'atomic_bool' with type 'atomic_bool' {aka '_Atomic _Bool'}
40 | typedef _Atomic _Bool atomic_bool;
| ^~~~~~~~~~~
ggml.c:51:13: error: expected identifier or '(' before '__extension__'
51 | static void atomic_store(atomic_int * ptr, LONG val) {
| ^~~~~~~~~~~~
ggml.c:51:13: error: expected identifier or '(' before ')' token
51 | static void atomic_store(atomic_int * ptr, LONG val) {
| ^~~~~~~~~~~~
ggml.c:54:13: error: expected identifier or '(' before '__extension__'
54 | static LONG atomic_load(atomic_int * ptr) {
| ^~~~~~~~~~~
ggml.c:54:13: error: expected identifier or '(' before ')' token
54 | static LONG atomic_load(atomic_int * ptr) {
| ^~~~~~~~~~~
ggml.c:57:13: error: expected declaration specifiers or '...' before '(' token
57 | static LONG atomic_fetch_add(atomic_int * ptr, LONG inc) {
| ^~~~~~~~~~~~~~~~
ggml.c:57:13: error: expected declaration specifiers or '...' before '(' token
57 | static LONG atomic_fetch_add(atomic_int * ptr, LONG inc) {
| ^~~~~~~~~~~~~~~~
ggml.c:57:13: error: expected declaration specifiers or '...' before numeric constant
57 | static LONG atomic_fetch_add(atomic_int * ptr, LONG inc) {
| ^~~~~~~~~~~~~~~~
ggml.c:60:13: error: expected declaration specifiers or '...' before '(' token
60 | static LONG atomic_fetch_sub(atomic_int * ptr, LONG dec) {
| ^~~~~~~~~~~~~~~~
ggml.c:60:13: error: expected declaration specifiers or '...' before '(' token
60 | static LONG atomic_fetch_sub(atomic_int * ptr, LONG dec) {
| ^~~~~~~~~~~~~~~~
ggml.c:60:13: error: expected declaration specifiers or '...' before numeric constant
60 | static LONG atomic_fetch_sub(atomic_int * ptr, LONG dec) {
| ^~~~~~~~~~~~~~~~
ggml.c: In function 'ggml_graph_compute_thread':
ggml.c:16993:30: warning: initialization of '_Atomic atomic_int *' {aka '_Atomic int *'} from incompatible pointer type 'volatile atomic_int *' {aka 'volatile long int *'} [-Wincompatible-pointer-types]
16993 | /*.aic =*/ &state->shared->aic,
| ^
ggml.c:16993:30: note: (near initialization for 'params.aic')
ggml.c:17077:26: warning: initialization of '_Atomic atomic_int *' {aka '_Atomic int *'} from incompatible pointer type 'volatile atomic_int *' {aka 'volatile long int *'} [-Wincompatible-pointer-types]
17077 | /*.aic =*/ &state->shared->aic,
| ^
ggml.c:17077:26: note: (near initialization for 'params.aic')
ggml.c: In function 'ggml_graph_compute_thread_hybrid':
ggml.c:17183:30: warning: initialization of '_Atomic atomic_int *' {aka '_Atomic int *'} from incompatible pointer type 'volatile atomic_int *' {aka 'volatile long int *'} [-Wincompatible-pointer-types]
17183 | /*.aic =*/ &state->shared->aic,
| ^
ggml.c:17183:30: note: (near initialization for 'params.aic')
ggml.c:17284:26: warning: initialization of '_Atomic atomic_int *' {aka '_Atomic int *'} from incompatible pointer type 'volatile atomic_int *' {aka 'volatile long int *'} [-Wincompatible-pointer-types]
17284 | /*.aic =*/ &state->shared->aic,
| ^
ggml.c:17284:26: note: (near initialization for 'params.aic')
make: *** [Makefile:533: ggml.o] Error 1
Intel 12100F, 16 GB 3200Mhz, pcie 4.0 SSD, Nvidia RTX 3060 12GB, Windows 11, Python 3.11, Visual Studio 2022, CUDA 12.1, 546.33 driver with sysmem fallback . https://github.com/bobozi-cmd/PowerInfer/commit/9593fe605257a1e7362a25a1019c67b0a5194c5a
git clone https://github.com/bobozi-cmd/PowerInfer
cd PowerInfer
pip install -r requirements.txt
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
C:\Users\Windows\AI\powerinfer.exe --vram-budget 12 -t 7 -m C:\Users\Windows\AI\llama-7b-relu.powerinfer.gguf --repeat_penalty 1 --no-penalize-nl --color --temp 0 --top-k 50 --top-p 1 -c 2048 -n 256 --seed 1 -p "Once upon the time"
llama_print_timings: load time = 8458.75 ms
llama_print_timings: sample time = 14.00 ms / 179 runs ( 0.08 ms per token, 12789.37 tokens per second)
llama_print_timings: prompt eval time = 13564.37 ms / 5 tokens ( 2712.87 ms per token, 0.37 tokens per second)
llama_print_timings: eval time = 36187.15 ms / 178 runs ( 203.30 ms per token, 4.92 tokens per second)
llama_print_timings: total time = 49795.77 ms
It works but it's 10 times slower (4.92 tokens per second vs 49.04 tokens per second on llama.cpp).
There are many questions on windows and current test result may not reliable , we are trying to fix them, please wait for us, thank you.
After discussion with @bobozi-cmd, we will continue the work on #114 and close this PR.
Hello, Thanks for your efforts. I have been trying to run on windows, through w64devkit, and while it worked and offloaded to the GPU, it's very slow. I used the llama-7b model. It takes more than a few seconds to generate one token.
My device: RTX 2060.