qwen.cpp
qwen.cpp copied to clipboard
Plan to Support CUDA device
Do you have plan to support Qwen inference on CUDA device? It seems too slowly on Mac M1.
Working on it!
666
@simonJJJ Hi. For Nvidia GPU, does the CUDA backend support Nvidia GPUs with compute capability 6.0, e.g., P100?
@simonJJJ Hi. For Nvidia GPU, does the CUDA backend support Nvidia GPUs with compute capability 6.0, e.g., P100?
I haven't tested on P100. If ggml support P100, it will work.
@simonJJJ Thanks. Llama.cpp w/ llama2-7B can work on P100. Does the Server in llama.cpp work with qwen.cpp? https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
@simonJJJ Could you please give some advice for this issue?
cmake -B build_cublas -DGGML_CUBLAS=ON && cmake --build build_cublas -j
[100%] Building CXX object CMakeFiles/qwen.dir/qwen.cpp.o
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:3,
from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/tiktoken.h: In lambda function:
/workspace/llm_serve/qwen.cpp/tiktoken.h:33:42: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::pair<int, int> >::size_type' {aka 'long unsigned int'} [-Wsign-compare]
33 | if (start_idx + skip + 2 < parts.size()) {
| ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp: In member function 'ggml_tensor* qwen::QwenAttention::forward(qwen::ModelContext*, ggml_tensor*, ggml_tensor*, int) const':
/workspace/llm_serve/qwen.cpp/qwen.cpp:368:76: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
368 | query_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, query_layer, KQ_pos, rope_dim, 2, n_ctx));
| ^~~~~~
| |
| ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note: initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
1238 | int n_past,
| ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp:379:72: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
379 | key_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, key_layer, KQ_pos, rope_dim, 2, n_ctx));
| ^~~~~~
| |
| ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note: initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
1238 | int n_past,
| ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
gmake[2]: *** [CMakeFiles/qwen.dir/build.make:76: CMakeFiles/qwen.dir/qwen.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:762: CMakeFiles/qwen.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2
@simonJJJ Could you please give some advice for this issue?
cmake -B build_cublas -DGGML_CUBLAS=ON && cmake --build build_cublas -j
[100%] Building CXX object CMakeFiles/qwen.dir/qwen.cpp.o In file included from /workspace/llm_serve/qwen.cpp/qwen.h:3, from /workspace/llm_serve/qwen.cpp/qwen.cpp:1: /workspace/llm_serve/qwen.cpp/tiktoken.h: In lambda function: /workspace/llm_serve/qwen.cpp/tiktoken.h:33:42: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::pair<int, int> >::size_type' {aka 'long unsigned int'} [-Wsign-compare] 33 | if (start_idx + skip + 2 < parts.size()) { | ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~ /workspace/llm_serve/qwen.cpp/qwen.cpp: In member function 'ggml_tensor* qwen::QwenAttention::forward(qwen::ModelContext*, ggml_tensor*, ggml_tensor*, int) const': /workspace/llm_serve/qwen.cpp/qwen.cpp:368:76: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive] 368 | query_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, query_layer, KQ_pos, rope_dim, 2, n_ctx)); | ^~~~~~ | | | ggml_tensor* In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5, from /workspace/llm_serve/qwen.cpp/qwen.cpp:1: /workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note: initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)' 1238 | int n_past, | ~~~~~~~~~~~~~~~~~~~~~~^~~~~~ /workspace/llm_serve/qwen.cpp/qwen.cpp:379:72: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive] 379 | key_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, key_layer, KQ_pos, rope_dim, 2, n_ctx)); | ^~~~~~ | | | ggml_tensor* In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5, from /workspace/llm_serve/qwen.cpp/qwen.cpp:1: /workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note: initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)' 1238 | int n_past, | ~~~~~~~~~~~~~~~~~~~~~~^~~~~~ gmake[2]: *** [CMakeFiles/qwen.dir/build.make:76: CMakeFiles/qwen.dir/qwen.cpp.o] Error 1 gmake[1]: *** [CMakeFiles/Makefile2:762: CMakeFiles/qwen.dir/all] Error 2 gmake: *** [Makefile:136: all] Error 2
Solved by updating submodules.
@simonJJJ Could you please give some advice for this issue?
cmake -B build_cublas -DGGML_CUBLAS=ON && cmake --build build_cublas -j
[100%] Building CXX object CMakeFiles/qwen.dir/qwen.cpp.o In file included from /workspace/llm_serve/qwen.cpp/qwen.h:3, from /workspace/llm_serve/qwen.cpp/qwen.cpp:1: /workspace/llm_serve/qwen.cpp/tiktoken.h: In lambda function: /workspace/llm_serve/qwen.cpp/tiktoken.h:33:42: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::pair<int, int> >::size_type' {aka 'long unsigned int'} [-Wsign-compare] 33 | if (start_idx + skip + 2 < parts.size()) { | ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~ /workspace/llm_serve/qwen.cpp/qwen.cpp: In member function 'ggml_tensor* qwen::QwenAttention::forward(qwen::ModelContext*, ggml_tensor*, ggml_tensor*, int) const': /workspace/llm_serve/qwen.cpp/qwen.cpp:368:76: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive] 368 | query_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, query_layer, KQ_pos, rope_dim, 2, n_ctx)); | ^~~~~~ | | | ggml_tensor* In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5, from /workspace/llm_serve/qwen.cpp/qwen.cpp:1: /workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note: initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)' 1238 | int n_past, | ~~~~~~~~~~~~~~~~~~~~~~^~~~~~ /workspace/llm_serve/qwen.cpp/qwen.cpp:379:72: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive] 379 | key_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, key_layer, KQ_pos, rope_dim, 2, n_ctx)); | ^~~~~~ | | | ggml_tensor* In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5, from /workspace/llm_serve/qwen.cpp/qwen.cpp:1: /workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note: initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)' 1238 | int n_past, | ~~~~~~~~~~~~~~~~~~~~~~^~~~~~ gmake[2]: *** [CMakeFiles/qwen.dir/build.make:76: CMakeFiles/qwen.dir/qwen.cpp.o] Error 1 gmake[1]: *** [CMakeFiles/Makefile2:762: CMakeFiles/qwen.dir/all] Error 2 gmake: *** [Makefile:136: all] Error 2
Solved by updating submodules.
Yes!
@simonJJJ Thanks. Llama.cpp w/ llama2-7B can work on P100. Does the Server in llama.cpp work with qwen.cpp? https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
I don't think it will work. I will modify llama.cpp's server code for qwen.cpp.
@simonJJJ I'm wondering whether we can develop a Triton backend (https://github.com/triton-inference-server/backend) for qwen.cpp. Then qwen.cpp can work with a Triton Inference Server.
@simonJJJ Hi, could you please give some advice for this issue?
input query len > 4000
gen_config w/ max_length = 8192, max_context_length = 5000
GGML_ASSERT: /workspace/qwen.cpp/third_party/ggml/src/ggml.c:5044: tensor != NULL
ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 1366188544, available 1342177280)
@simonJJJ Hi, could you please give some advice for this issue?
input query len > 4000 gen_config w/ max_length = 8192, max_context_length = 5000 GGML_ASSERT: /workspace/qwen.cpp/third_party/ggml/src/ggml.c:5044: tensor != NULL ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 1366188544, available 1342177280)
Solved by increasing value of MEM_SIZE
and SCRATCH_SIZE
in qwen.h.
Updating GGML can be a better solution for long context inference.