qwen.cpp Plan to Support CUDA device

Do you have plan to support Qwen inference on CUDA device? It seems too slowly on Mac M1.

Sep 25 '23 16:09 songkq

Working on it!

Sep 26 '23 04:09 simonJJJ

666

Sep 26 '23 07:09 iDonal

@simonJJJ Hi. For Nvidia GPU, does the CUDA backend support Nvidia GPUs with compute capability 6.0, e.g., P100?

Oct 19 '23 08:10 songkq

@simonJJJ Hi. For Nvidia GPU, does the CUDA backend support Nvidia GPUs with compute capability 6.0, e.g., P100?

I haven't tested on P100. If ggml support P100, it will work.

Oct 19 '23 09:10 simonJJJ

@simonJJJ Thanks. Llama.cpp w/ llama2-7B can work on P100. Does the Server in llama.cpp work with qwen.cpp? https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

Oct 19 '23 10:10 songkq

@simonJJJ Could you please give some advice for this issue? cmake -B build_cublas -DGGML_CUBLAS=ON && cmake --build build_cublas -j


[100%] Building CXX object CMakeFiles/qwen.dir/qwen.cpp.o
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:3,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/tiktoken.h: In lambda function:
/workspace/llm_serve/qwen.cpp/tiktoken.h:33:42: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::pair<int, int> >::size_type' {aka 'long unsigned int'} [-Wsign-compare]
   33 |                 if (start_idx + skip + 2 < parts.size()) {
      |                     ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp: In member function 'ggml_tensor* qwen::QwenAttention::forward(qwen::ModelContext*, ggml_tensor*, ggml_tensor*, int) const':
/workspace/llm_serve/qwen.cpp/qwen.cpp:368:76: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
  368 |   query_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, query_layer, KQ_pos, rope_dim, 2, n_ctx));
      |                                                                            ^~~~~~
      |                                                                            |
      |                                                                            ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note:   initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
 1238 |             int                   n_past,
      |             ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp:379:72: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
  379 |   key_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, key_layer, KQ_pos, rope_dim, 2, n_ctx));
      |                                                                        ^~~~~~
      |                                                                        |
      |                                                                        ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note:   initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
 1238 |             int                   n_past,
      |             ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
gmake[2]: *** [CMakeFiles/qwen.dir/build.make:76: CMakeFiles/qwen.dir/qwen.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:762: CMakeFiles/qwen.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2

Oct 19 '23 10:10 songkq

@simonJJJ Could you please give some advice for this issue? cmake -B build_cublas -DGGML_CUBLAS=ON && cmake --build build_cublas -j


[100%] Building CXX object CMakeFiles/qwen.dir/qwen.cpp.o
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:3,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/tiktoken.h: In lambda function:
/workspace/llm_serve/qwen.cpp/tiktoken.h:33:42: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::pair<int, int> >::size_type' {aka 'long unsigned int'} [-Wsign-compare]
   33 |                 if (start_idx + skip + 2 < parts.size()) {
      |                     ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp: In member function 'ggml_tensor* qwen::QwenAttention::forward(qwen::ModelContext*, ggml_tensor*, ggml_tensor*, int) const':
/workspace/llm_serve/qwen.cpp/qwen.cpp:368:76: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
  368 |   query_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, query_layer, KQ_pos, rope_dim, 2, n_ctx));
      |                                                                            ^~~~~~
      |                                                                            |
      |                                                                            ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note:   initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
 1238 |             int                   n_past,
      |             ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp:379:72: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
  379 |   key_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, key_layer, KQ_pos, rope_dim, 2, n_ctx));
      |                                                                        ^~~~~~
      |                                                                        |
      |                                                                        ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note:   initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
 1238 |             int                   n_past,
      |             ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
gmake[2]: *** [CMakeFiles/qwen.dir/build.make:76: CMakeFiles/qwen.dir/qwen.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:762: CMakeFiles/qwen.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2

Solved by updating submodules.

Oct 19 '23 12:10 songkq

@simonJJJ Could you please give some advice for this issue? cmake -B build_cublas -DGGML_CUBLAS=ON && cmake --build build_cublas -j


[100%] Building CXX object CMakeFiles/qwen.dir/qwen.cpp.o
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:3,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/tiktoken.h: In lambda function:
/workspace/llm_serve/qwen.cpp/tiktoken.h:33:42: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::pair<int, int> >::size_type' {aka 'long unsigned int'} [-Wsign-compare]
   33 |                 if (start_idx + skip + 2 < parts.size()) {
      |                     ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp: In member function 'ggml_tensor* qwen::QwenAttention::forward(qwen::ModelContext*, ggml_tensor*, ggml_tensor*, int) const':
/workspace/llm_serve/qwen.cpp/qwen.cpp:368:76: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
  368 |   query_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, query_layer, KQ_pos, rope_dim, 2, n_ctx));
      |                                                                            ^~~~~~
      |                                                                            |
      |                                                                            ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note:   initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
 1238 |             int                   n_past,
      |             ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/workspace/llm_serve/qwen.cpp/qwen.cpp:379:72: error: invalid conversion from 'ggml_tensor*' to 'int' [-fpermissive]
  379 |   key_layer = tensor_assign_buffers(ggml_rope_inplace(gctx, key_layer, KQ_pos, rope_dim, 2, n_ctx));
      |                                                                        ^~~~~~
      |                                                                        |
      |                                                                        ggml_tensor*
In file included from /workspace/llm_serve/qwen.cpp/qwen.h:5,
                 from /workspace/llm_serve/qwen.cpp/qwen.cpp:1:
/workspace/llm_serve/qwen.cpp/third_party/ggml/include/ggml/ggml.h:1238:35: note:   initializing argument 3 of 'ggml_tensor* ggml_rope_inplace(ggml_context*, ggml_tensor*, int, int, int, int)'
 1238 |             int                   n_past,
      |             ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
gmake[2]: *** [CMakeFiles/qwen.dir/build.make:76: CMakeFiles/qwen.dir/qwen.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:762: CMakeFiles/qwen.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2

Solved by updating submodules.

Yes!

Oct 19 '23 12:10 simonJJJ

@simonJJJ Thanks. Llama.cpp w/ llama2-7B can work on P100. Does the Server in llama.cpp work with qwen.cpp? https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

I don't think it will work. I will modify llama.cpp's server code for qwen.cpp.

Oct 19 '23 12:10 simonJJJ

@simonJJJ I'm wondering whether we can develop a Triton backend (https://github.com/triton-inference-server/backend) for qwen.cpp. Then qwen.cpp can work with a Triton Inference Server.

Oct 20 '23 01:10 songkq

@simonJJJ Hi, could you please give some advice for this issue?

input query len > 4000
gen_config w/ max_length = 8192, max_context_length = 5000

GGML_ASSERT: /workspace/qwen.cpp/third_party/ggml/src/ggml.c:5044: tensor != NULL
ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 1366188544, available 1342177280)

Nov 03 '23 02:11 songkq

@simonJJJ Hi, could you please give some advice for this issue?

input query len > 4000
gen_config w/ max_length = 8192, max_context_length = 5000

GGML_ASSERT: /workspace/qwen.cpp/third_party/ggml/src/ggml.c:5044: tensor != NULL
ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 1366188544, available 1342177280)

Solved by increasing value of MEM_SIZE and SCRATCH_SIZE in qwen.h. Updating GGML can be a better solution for long context inference.

Nov 07 '23 13:11 songkq

qwen.cpp qwen.cpp copied to clipboard

Plan to Support CUDA device

qwen.cpp
qwen.cpp copied to clipboard