rag-api-server Can't run glm-4-9b-chat on cuda 12

When I run glm-4-9b-chat-Q5_K_M.gguf on the Cuda 12 machine, the API server can be started successfully. However, when I send a question, the API server will crash.

The command I used to start the API server is as follows:

wasmedge --dir .:. --nn-preload default:GGML:AUTO:glm-4-9b-chat-Q5_K_M.gguf \
  llama-api-server.wasm \
  --prompt-template glm-4-chat \
  --ctx-size 4096 \
  --model-name glm-4-9b-chat

Here is the error message

[2024-07-11 07:50:12.036] [wasi_logging_stdout] [info] llama_core: llama_core::chat in llama-core/src/chat.rs:1315: Get the model metadata.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] llama_core: llama_core::chat in llama-core/src/chat.rs:1349: The model `internlm2_5-7b-chat` does not exist in the chat graphs.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] chat_completions_handler: llama_api_server::backend::ggml in llama-api-server/src/backend/ggml.rs:392: Failed to get chat completions. Reason: The model `internlm2_5-7b-chat` does not exist in the chat graphs.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] response: llama_api_server::error in llama-api-server/src/error.rs:25: 500 Internal Server Error: Failed to get chat completions. Reason: The model `internlm2_5-7b-chat` does not exist in the chat graphs.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [info] chat_completions_handler: llama_api_server::backend::ggml in llama-api-server/src/backend/ggml.rs:399: Send the chat completion response.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] response: llama_api_server in llama-api-server/src/main.rs:518: version: HTTP/1.1, body_size: 133, status: 500, is_informational: false, is_success: false, is_redirection: false, is_client_error: false, is_server_error: true

Versions

[2024-07-11 07:47:51.368] [wasi_logging_stdout] [info] server_config: llama_api_server in llama-api-server/src/main.rs:131: server version: 0.12.3

Jul 11 '24 07:07 alabulei1

According to the log message below, the model name in the request was incorrectly set to internlm2_5-7b-chat. The correct name is glm-4-9b-chat that was set to --model-name option in the command.

[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] llama_core: llama_core::chat in llama-core/src/chat.rs:1349: The model `internlm2_5-7b-chat` does not exist in the chat graphs.

Jul 11 '24 08:07 apepkuss

In case there is something wrong with the chatbot UI, I send an API request to the model

curl -X POST http://localhost:8080/v1/chat/completions \
    -H 'accept:application/json' \
    -H 'Content-Type: application/json' \
    -d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "Plan me a two day trip in Paris"}], "model":"glm-4-9b-chat"}'
curl: (52) Empty reply from server

The log is as follows:

[2024-07-11 09:59:04.506] [info] [WASI-NN] llama.cpp: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model:      CUDA0 compute buffer size =   304.00 MiB
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: graph nodes  = 1606
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: graph splits = 2
[2024-07-11 09:59:04.743] [error] [WASI-NN] llama.cpp: CUDA error: an illegal memory access was encountered
[2024-07-11 09:59:04.743] [error] [WASI-NN] llama.cpp:   current device: 0, in function launch_mul_mat_q at /__w/wasi-nn-ggml-plugin/wasi-nn-ggml-plugin/build/_deps/llama-src/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2602
[2024-07-11 09:59:04.743] [error] [WASI-NN] llama.cpp:   cudaFuncSetAttribute(mul_mat_q<type, mmq_x, 8, false>, cudaFuncAttributeMaxDynamicSharedMemorySize, shmem)
GGML_ASSERT: /__w/wasi-nn-ggml-plugin/wasi-nn-ggml-plugin/build/_deps/llama-src/ggml/src/ggml-cuda.cu:101: !"CUDA error"
Aborted (core dumped)

Jul 11 '24 10:07 alabulei1

A cuda error was triggered. It seems like a memory issue. @hydai Could you please help with the issue? Thanks!

Jul 11 '24 11:07 apepkuss

Is this model supported by llama.cpp?

Jul 11 '24 12:07 hydai

Yes.

Jul 11 '24 13:07 apepkuss

Could you please try llama.cpp with CUDA enabled to run this model? Since this is an internal error in the llama.cpp CUDA backend, I would like to know if this is an upstream issue.

Jul 11 '24 15:07 hydai