mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

higher memory usage than llama-cpp

Open mahmoodsh36 opened this issue 8 months ago • 0 comments

Describe the bug

i can run QwQ-32B with 2^14 context length using llama-cpp on my 4090 just fine with memory usage almost capped out at 23.5/24gb. but the same (equivalent) command doesnt work when i try it with mistral-rs on the same gguf. there could be several reasons for this. the command im using for llama-server:

llama-server --host 0.0.0.0 --port 5000 -m final-Qwen--QwQ-32B.gguf --host 0.0.0.0 --n-gpu-layers 100 --flash-attn -c $((2 ** 14)) --jinja

this is part of the output i get when i run with llama-server:

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 64 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors:        CUDA0 model buffer size = 18508.35 MiB
load_tensors:   CPU_Mapped model buffer size =   417.66 MiB
................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512

however, when i try the following command with mistral.rs:

mistralrs-server --num-device-layers 100 --port 5000 gguf -m . -f final-Qwen--QwQ-32B.gguf --dtype f16 --max-seq-len $((2 ** 13))

i get:

192.168.1.2 models λ mistralrs-server --num-device-layers 100 --port 5000 gguf -m . -f final-Qwen--QwQ-32B.gguf --dtype f16 --max-seq-len $((2 ** 13))
2025-05-03T17:35:40.296575Z  INFO mistralrs_server: avx: false, neon: false, simd128: false, f16c: false
2025-05-03T17:35:40.296594Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-05-03T17:35:40.296601Z  INFO mistralrs_server: Model kind is: gguf quantized from gguf (no adapters)
2025-05-03T17:35:40.296623Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/mahmooz/.cache/huggingface/token", using no HF token.
2025-05-03T17:35:40.296656Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/mahmooz/.cache/huggingface/token", using no HF token.
2025-05-03T17:35:40.296664Z  INFO mistralrs_core::pipeline::paths: Loading `final-Qwen--QwQ-32B.gguf` locally at `./final-Qwen--QwQ-32B.gguf`
2025-05-03T17:35:40.296778Z  INFO mistralrs_core::pipeline::gguf: Prompt chunk size is 1024.
2025-05-03T17:35:40.436643Z  INFO mistralrs_core::gguf::content: Model config:
general.architecture: qwen2
general.base_model.0.name: Qwen2.5 32B
general.base_model.0.organization: Qwen
general.base_model.0.repo_url: https://huggingface.co/Qwen/Qwen2.5-32B
general.base_model.count: 1
general.file_type: 15
general.finetune: 976055f8c83f394f35dbd3ab09a285a984907bd0
general.languages: en
general.license: apache-2.0
general.license.link: https://huggingface.co/Qwen/QWQ-32B/blob/main/LICENSE
general.name: 976055f8c83f394f35dbd3ab09a285a984907bd0
general.quantization_version: 2
general.size_label: 33B
general.tags: chat, text-generation
general.type: model
qwen2.attention.head_count: 40
qwen2.attention.head_count_kv: 8
qwen2.attention.layer_norm_rms_epsilon: 0.00001
qwen2.block_count: 64
qwen2.context_length: 40960
qwen2.embedding_length: 5120
qwen2.feed_forward_length: 27648
qwen2.rope.freq_base: 1000000
2025-05-03T17:35:40.436690Z  INFO mistralrs_core::utils::log: Model has 64 repeating layers.
2025-05-03T17:35:40.436699Z  INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings:
2025-05-03T17:35:40.436741Z  INFO mistralrs_core::utils::log: Layers 0-63: cuda[0] (24 GB)
2025-05-03T17:35:40.522283Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 152064, num added tokens: 0, num merges: 151387, num scores: 0
2025-05-03T17:35:40.526442Z  INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{%- if tools %}\n    {{- '<|im_start|>system\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- else %}\n        {{- '' }}\n    {%- endif %}\n    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}\n    {%- for tool in tools %}\n        {{- "\n" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}\n  {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}\n        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}\n    {%- elif message.role == "assistant" and not message.tool_calls %}\n        {%- set content = message.content %}\n        {%- if not loop.last %}\n            {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}\n        {%- endif %}\n        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}\n    {%- elif message.role == "assistant" %}\n        {%- set content = message.content %}\n        {%- if not loop.last %}\n            {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}\n        {%- endif %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\n' + content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\n<tool_call>\n{"name": "' }}\n{{- tool_call.name }}\n            {{- '", "arguments": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\n' }}\n    {%- elif message.role == "tool" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\n<tool_response>\n' }}\n        {{- message.content }}\n        {{- '\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n            {{- '<|im_end|>\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\n<think>\n' }}\n{%- endif %}\n`
2025-05-03T17:35:40.526455Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-05-03T17:35:47.999300Z  INFO mistralrs_core::paged_attention: Allocating 2048 MB for PagedAttention KV cache per GPU
2025-05-03T17:35:47.999311Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 256 GPU blocks: available context length is 8192 tokens
Error: DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")

notice that i purposefully chose a lower context size. there could be several reasons for this:

  1. dtype is by default something else in llama-cpp (here im using f16 because otherwise i get Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_f32_bf16)
  2. mistral-rs is not using flash attention which i am enabling with llama-server. but i made sure to compile mistral-rs with flash-attn feature enabled. (this is the most likely reason since afaik flash attention is a huge improvement in efficiency)
  3. something else im unaware of

fwiw, i also tried running from the safetensors themselves with inflight isq quantization with the following command:

mistralrs-server --port 5000 --num-device-layers 100 --isq q4k plain --model-id Qwen/Qwen3-32B --dtype f16 --max-seq-len $((2 ** 14))

but i also get out of memory errors. even though qwen3-32b too runs just fine on llama-server with that context length (basically any 32b model at 4bit quant runs fine for me with llama-cpp)

Latest commit or version

using revision (commit) a63da3c03f52db350a04b15a1c8775dcb8d5033f (very recent, built from master)

thank you

mahmoodsh36 avatar May 03 '25 17:05 mahmoodsh36