Can't run glm-4-9b-chat on cuda 12
When I run glm-4-9b-chat-Q5_K_M.gguf on the Cuda 12 machine, the API server can be started successfully. However, when I send a question, the API server will crash.
The command I used to start the API server is as follows:
wasmedge --dir .:. --nn-preload default:GGML:AUTO:glm-4-9b-chat-Q5_K_M.gguf \
llama-api-server.wasm \
--prompt-template glm-4-chat \
--ctx-size 4096 \
--model-name glm-4-9b-chat
Here is the error message
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [info] llama_core: llama_core::chat in llama-core/src/chat.rs:1315: Get the model metadata.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] llama_core: llama_core::chat in llama-core/src/chat.rs:1349: The model `internlm2_5-7b-chat` does not exist in the chat graphs.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] chat_completions_handler: llama_api_server::backend::ggml in llama-api-server/src/backend/ggml.rs:392: Failed to get chat completions. Reason: The model `internlm2_5-7b-chat` does not exist in the chat graphs.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] response: llama_api_server::error in llama-api-server/src/error.rs:25: 500 Internal Server Error: Failed to get chat completions. Reason: The model `internlm2_5-7b-chat` does not exist in the chat graphs.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [info] chat_completions_handler: llama_api_server::backend::ggml in llama-api-server/src/backend/ggml.rs:399: Send the chat completion response.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] response: llama_api_server in llama-api-server/src/main.rs:518: version: HTTP/1.1, body_size: 133, status: 500, is_informational: false, is_success: false, is_redirection: false, is_client_error: false, is_server_error: true
Versions
[2024-07-11 07:47:51.368] [wasi_logging_stdout] [info] server_config: llama_api_server in llama-api-server/src/main.rs:131: server version: 0.12.3
According to the log message below, the model name in the request was incorrectly set to internlm2_5-7b-chat. The correct name is glm-4-9b-chat that was set to --model-name option in the command.
[2024-07-11 07:50:12.036] [wasi_logging_stdout] [error] llama_core: llama_core::chat in llama-core/src/chat.rs:1349: The model `internlm2_5-7b-chat` does not exist in the chat graphs.
In case there is something wrong with the chatbot UI, I send an API request to the model
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "Plan me a two day trip in Paris"}], "model":"glm-4-9b-chat"}'
curl: (52) Empty reply from server
The log is as follows:
[2024-07-11 09:59:04.506] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: CUDA0 compute buffer size = 304.00 MiB
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: graph nodes = 1606
[2024-07-11 09:59:04.518] [info] [WASI-NN] llama.cpp: llama_new_context_with_model: graph splits = 2
[2024-07-11 09:59:04.743] [error] [WASI-NN] llama.cpp: CUDA error: an illegal memory access was encountered
[2024-07-11 09:59:04.743] [error] [WASI-NN] llama.cpp: current device: 0, in function launch_mul_mat_q at /__w/wasi-nn-ggml-plugin/wasi-nn-ggml-plugin/build/_deps/llama-src/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2602
[2024-07-11 09:59:04.743] [error] [WASI-NN] llama.cpp: cudaFuncSetAttribute(mul_mat_q<type, mmq_x, 8, false>, cudaFuncAttributeMaxDynamicSharedMemorySize, shmem)
GGML_ASSERT: /__w/wasi-nn-ggml-plugin/wasi-nn-ggml-plugin/build/_deps/llama-src/ggml/src/ggml-cuda.cu:101: !"CUDA error"
Aborted (core dumped)
A cuda error was triggered. It seems like a memory issue. @hydai Could you please help with the issue? Thanks!
Is this model supported by llama.cpp?
Yes.
Could you please try llama.cpp with CUDA enabled to run this model? Since this is an internal error in the llama.cpp CUDA backend, I would like to know if this is an upstream issue.