ik_llama.cpp icon indicating copy to clipboard operation
ik_llama.cpp copied to clipboard

Feature Request: support intel amx for further accelerate

Open zhaoyukoon opened this issue 7 months ago • 102 comments

Prerequisites

  • [x] I am running the latest code. Mention the version if possible as well.
  • [x] I carefully followed the README.md.
  • [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I learned from ktransformers-Intel-AMX that amx instruction can further improve inference speed for MoE models.

Is there any plan to support amx in ik_llama? Thanks!

Motivation

Ktransformer kernel can achieve 21 TFLOPS of BF16 throughput and 35 TOPS of Int8 throughput on Xeon4 CPUs — about 4× faster than PyTorch’s general AMX kernel. For DeepSeek-V3, pairing a Xeon4 CPU with a single RTX 4090 GPU achieves 418 tokens/s end-to-end throughput, close to the performance of multi-machine, multi-GPU setups. KTransformers’ AMX kernel is the first AMX kernel specifically designed for MoE inference scenarios, significantly lowering the hardware barrier for large model deployment and enabling more developers to enjoy GPU cluster level inference experiences at lower cost.

Possible Implementation

No response

zhaoyukoon avatar May 20 '25 08:05 zhaoyukoon

If someone gives me access to a system with AMX support, then sure, I would work on that.

But out of curiosity, do you have a performance comparison between ik_llama.cpp and KTransformers on the same system?

ikawrakow avatar May 20 '25 09:05 ikawrakow

If someone gives me access to a system with AMX support, then sure, I would work on that.

But out of curiosity, do you have a performance comparison between ik_llama.cpp and KTransformers on the same system?

I can access a server equipped with AMX Intel CPUs, however I have no permission to add other uses. I can help to run test on this server.

I tested ktransformers on another AMD server with 24GB 4090D, which can get 15+ tokens/s decoding speed. I have not tested ik_llama yet, I learned that llama.cpp can get 7 tokens/s on pure CPU.

https://github.com/XuanwuLab/llama.cpp_deepseek/blob/main/llama-mmap.cpp

https://mp.weixin.qq.com/s/vIrvbVJ6Nv00Ehre1zZwMw [In Chinese]

zhaoyukoon avatar May 20 '25 10:05 zhaoyukoon

I cannot say that I'm particularly impressed with the performance reported in ktransformers-Intel-AMX. For convenience here is what they report:

Image

My system is Ryzen-7950X CPU + 4080 GPU. Based on benchmarks from here and here, my CPU is only marginally faster than their "consumer" level system with Intel-14900KF + 4090 GPU. I don't have enough RAM to run Qwen3-235-A22B, but here is what I get for Qwen3-30B-A3B quantized with IQ4_XS (so corresponds to their 4-bit result) with ik_llama.cpp:

CPU only

model size backend test t/s
qwen3moe 30B IQ4_XS 15.24 GiB CPU pp512 480.78 ± 2.11
qwen3moe 30B IQ4_XS 15.24 GiB CPU tg128 29.17 ± 0.08

Here pp512 corresponds to what they call "prefill" and tg128 is what they call "decode". So, even without a GPU ik_llama.cpp beets their prefill performance by 2X, and is faster than their "4-way decode" performance on the "consumer" level system that has roughly the same speed as mine.

CPU+GPU

Here speed depends on how many layers I offload to the GPU. But let's keep 18 layers on the CPU so I have enough VRAM for the maximum context of 41,000 tokens on my paltry 16 GB GPU. Here is what I get with that:

model size backend test t/s
qwen3moe 30B IQ4_XS 15.24 GiB CUDA pp2048 3039.84 ± 24.96
qwen3moe 30B IQ4_XS 15.24 GiB CUDA tg128 77.44 ± 0.39

So, 15X their prefill performance and 3X their "4-way decode" performance ("consumer level" system), and 8.7X prefill, 1.5X "4-way decode" (Xeon 4 workstation).

ikawrakow avatar May 20 '25 10:05 ikawrakow

I can access a server equipped with AMX Intel CPUs, however I have no permission to add other uses. I can help to run test on this server.

This will be way too tedious. I have to build with AMX instructions enabled, then you test and find gibberish, then I second-guess where is the bug, change something, you test and find gibberish, rinse and repeat. I have to have access to the AMX-enabled system while writing the code.

ikawrakow avatar May 20 '25 10:05 ikawrakow

I can access a server equipped with AMX Intel CPUs, however I have no permission to add other uses. I can help to run test on this server.

This will be way too tedious. I have to build with AMX instructions enabled, then you test and find gibberish, then I second-guess where is the bug, change something, you test and find gibberish, rinse and repeat. I have to have access to the AMX-enabled system while writing the code.

Do you have any requirements on CPU and memory for development? Is server with 16 vCPU AMX and 32GB enough?

zhaoyukoon avatar May 20 '25 11:05 zhaoyukoon

Do you have any requirements on CPU and memory for development? Is server with 16 vCPU AMX and 32GB enough?

Yes, that's should be enough for development.

But before you go and rent a cloud instance, let's start by you fist testing ik_llama.cpp on your system and comparing performance to KTransformers.

Let's also make sure that the expectations are aligned:

  • It is extremely unlikely AMX will improve token generation (TG) speed
  • It is very unlikely AMX will improve prefill speed for hybrid CPU/GPU inference for most models. Only the LLaMA-4 models may get faster
  • AMX will improve prefill performance for CPU-only inference compared to vanilla AVX2 implementations such as what you have in llama.cpp or KTransformers. If it will improve performance compared to the existing ik_llama.cpp implementation remains to be seen.

ikawrakow avatar May 20 '25 11:05 ikawrakow

While I’d be excited to see AMX support, I can’t say the kTransformers Qwen3 benchmark proves its usefulness. I can’t verify the pp/tg window sizes or the exact model they used, but as an inexact comparison, I got the below results in ik_llama for Qwen3 235B with Xeon 8480 (ES), 8-channel 4800MT DDR5 and a blackwell GPU.

Model used: unsloth/Qwen3-235B-A22B-GGUF/UD-Q4_K_XL/Qwen3-235B-A22B-UD-Q4_K_XL

size params backend ngl threads n_batch n_ubatch fa rtr fmoe test t/s
124.91 GiB 235.09 B CUDA 93 52 8192 8192 1 1 1 pp2048 192.02 ± 0.06
124.91 GiB 235.09 B CUDA 93 52 8192 8192 1 1 1 pp16384 185.33 ± 0.34
124.91 GiB 235.09 B CUDA 93 52 8192 8192 1 1 1 tg512 18.74 ± 0.02
124.91 GiB 235.09 B CUDA 93 52 8192 8192 1 1 1 tg2048 18.58 ± 0.03

The 30B model performs really well on CPU only, below is with GPU hidden.

Model used: unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q4_K_XL

size params backend ngl threads fa fmoe test t/s
16.49 GiB 30.53 B CUDA 0 32 1 1 pp512 510.65 ± 2.49
16.49 GiB 30.53 B CUDA 0 32 1 1 pp2048 454.62 ± 0.18
16.49 GiB 30.53 B CUDA 0 32 1 1 tg128 69.77 ± 0.02
16.49 GiB 30.53 B CUDA 0 32 1 1 tg512 69.15 ± 0.01

Thanks a lot for the impressive work ikawrakow!

kirnat avatar May 20 '25 14:05 kirnat

Has anyone tried mainline llama.cpp AMX implementation?

ikawrakow avatar May 20 '25 15:05 ikawrakow

Has anyone tried mainline llama.cpp AMX implementation?

https://github.com/ggml-org/llama.cpp/issues/12003

It seems that llama.cpp supports amx

zhaoyukoon avatar May 20 '25 16:05 zhaoyukoon

It seems that llama.cpp supports amx.

That's why I asked if somebody has tried. It would be even more interesting if someone has compared llama.cpp performance to ik_llama.cpp on an AMX CPU.

ikawrakow avatar May 20 '25 16:05 ikawrakow

Confirming AMX buffer

llama.cpp/build/bin/llama-cli -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: CPU_Mapped model buffer size = 4685.30 MiB load_tensors: AMX model buffer size = 4491.48 MiB ........................................................................................ llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.49 MiB llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32 llama_kv_cache_unified: CPU KV buffer size = 512.00 MiB llama_kv_cache_unified: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_context: CPU compute buffer size = 296.01 MiB

llama.cpp bench

llama.cpp/build/bin/llama-bench -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -t 52 -fa 1

model size params backend threads fa test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 52 1 pp512 228.18 ± 0.03
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CPU 52 1 tg128 37.28 ± 0.01

build: e3a9421b (5389)

ik_llama bench

ik_llama.cpp/build/bin/llama-bench -ngl 0 -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -t 52 -fa 1 ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected

model size params backend ngl threads fa test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 0 52 1 pp512 348.00 ± 0.43
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 0 52 1 tg128 42.48 ± 0.03

build: 2ec2229f (3702)


Let me know if you want me to test with another model specific settings. I used high thread count since it helps prompt processing while penalizes token generation slightly, but not too much in this case.

kirnat avatar May 20 '25 19:05 kirnat

Thanks!

You could try adding -rtr 1 to the ik_llama.cpp benchmark run. This normally gives a significant boost in PP performance.

ikawrakow avatar May 20 '25 19:05 ikawrakow

I hadn't even considered it for CPU only inference. I have used it alot day to day for hybrid inference with great results.

Same settings as above and GPU hidden but with rtr enabled.

model size params backend ngl threads fa rtr test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 0 52 1 1 pp512 444.89 ± 0.96
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 0 52 1 1 pp16384 267.07 ± 3.60
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 0 52 1 1 tg128 43.21 ± 0.03
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 0 52 1 1 tg2048 41.39 ± 0.01

Still amazed how relatively low the slow down is in ik at larger context sizes. This tranlates to Qwen3, DeepSeek v3 and Llama 4 Maverick as well.

kirnat avatar May 20 '25 20:05 kirnat

So, ik_llama.cpp without AMX is nearly two times faster than llama.cpp with AMX.

ikawrakow avatar May 21 '25 04:05 ikawrakow

Specifically for the new r1-0528 (but results are similar for v3-0324):

I have an amx supported pc, and I can confirm that performance for ktransformers is noticibly better than ik_llama and llama.cpp (in that same order) for prompt processing. In general I get about 50 prefill (prompt processing) on ktransformers and 10 tk/s on generation. It is the prompt processing that has massive benefits on ktransformers. I get less than half on the prompt processing on ik_llama.cpp. token generation is comparable (but ktransformers has about 10% advantage).

With KTransformers, I can only fit 24K context length on single 4090 on my PC (512 GB DDR5 ram, 8 channel, 4800 MHz), whereas I can fit 32K context length with similar quants on ik_llama

Another difference with KTransformers is that I can make a hybrid model q4km_fp8 hybrid and all of fp8 processing is done on the GPU. Apparantly they have some special kernal that helps it speed it up on the GPU with FP8 processing.

I have been following @ubergarm 's quants and guide to run it on ik_llama.

I love your work here @ikawrakow and I would love to contribute in anyway to make this project better than ktransformers! If you need me to run anything please let me know!

mtcl avatar Jun 08 '25 06:06 mtcl

I have an amx supported pc, and I can confirm that performance for ktransformers is noticibly better than ik_llama and llama.cpp (in that same order) for prompt processing. In general I get about 50 prefill (prompt processing) on ktransformers and 10 tk/s on generation. It is the prompt processing that has massive benefits on ktransformers. I get less than half on the prompt processing on ik_llama.cpp. token generation is comparable (but ktransformers has about 10% advantage).

If you share your ik_llama.cpp command line you used to measure performance, perhaps we can help you make it faster. You didn't share the specs of your CPU and GPU, but 25 t/s prefill given 9 t/s generation does not sound reasonable for ik_llama.cpp.

ikawrakow avatar Jun 08 '25 06:06 ikawrakow

@mtcl

Hey thanks again for your youtube video showing your OpenAI compatible wrapper working with ik_llama.cpp and one of my quants! Very cool!

It is the prompt processing that has massive benefits on ktransformers.

ik has shown me and others have reported success increasing prompt processing (prefill on ktransformers) by increasing batch size e.g. -b 4096 -ub 4096 assuming you free up enough VRAM using -ctk q8_0 or lower context a bit etc. You might have to play with exact numbers to adjust speed / VRAM usage tradeoff and might not work on all setups. I can hit over 200 tok/sec prompt processing with some of my R1-0528 quants using this.

Another difference with KTransformers is that I can make a hybrid model q4km_fp8 hybrid and all of fp8 processing is done on the GPU.

They do have some goofy hybrid quants that use fp8 for GPU offload layers which requires CUDA 40 series or newer that support fp8 E4M3. But the 3090 and older only supports fp8 E5M2 so those ktransformers kernels are not widely applicable. My quants use high quality iq4_ks, iq5_ks, or full q8_0 for those tensors which will likely be better quality and more performant across a wider variety of systems.

Finally, last I checked, ktransformers performance tanked when attempting to offload additional layers onto GPU given how they were relying on cuda graphs. So after fiddling with those confusing yaml files performance was worse when using more VRAM... This was a couple months ago so ymmv. So multi-GPU story on ik seems much better imo unless things changed radically over on ktransformers.

But as ik says, share your commands and might be able to get you a boost.

ubergarm avatar Jun 08 '25 15:06 ubergarm

@ubergarm and @ikawrakow Below is for Qwen3-235 billion parameter model. Tthank you for the pointers! For the qwen models, i added "-b 2048 -ub 2048" and that resulted in the max speeds for me. I am getting 150+ prompt processing tk/seconds on that now! That is insane!

This was my original command

CUDA_VISIBLE_DEVICES="1" ./build/bin/llama-server \
  --model /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
  --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
  -fa \
  -ctk q4_0 -ctv q4_0 \
  -c 32768 \
  -fmoe \
  -amb 512 \
  -rtr \
  -ot blk\.1[2-9]\.ffn.*=CPU \
  -ot blk\.[2-8][0-9]\.ffn.*=CPU \
  -ot blk\.9[0-3]\.ffn.*=CPU \
  -ngl 99 \
  --threads 57 \
  --host 0.0.0.0 \
  --port 10002

I was getting very low prompt processing with this, under 50. After your recommendation, I switched the command to this:

CUDA_VISIBLE_DEVICES="1" ./build/bin/llama-server \
  --model /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
  --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
  -fa \
  -ctk q4_0 -ctv q4_0 \
  -c 32768 \
  -fmoe \
  -b 2048 -ub 2048 \
  -amb 512 \
  -rtr \
  -ot blk\.1[2-9]\.ffn.*=CPU \
  -ot blk\.[2-8][0-9]\.ffn.*=CPU \
  -ot blk\.9[0-3]\.ffn.*=CPU \
  -ngl 99 \
  --threads 57 \
  --host 0.0.0.0 \
  --port 10002

DeepSeek-R1-0528 You see how I have "ctk q4_0" in there right? It works for the Qwen model but not for the DeepSeek R1 model.

This is my command before:

CUDA_VISIBLE_DEVICES="1" ./build/bin/llama-server \
    --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
    --alias ubergarm/DeepSeek-R1-0528-GGUF \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 57 \
    --host 0.0.0.0 \
    --port 10002

This is my command after:

CUDA_VISIBLE_DEVICES="1" ./build/bin/llama-server \
    --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
    --alias ubergarm/DeepSeek-R1-0528-GGUF \
    --ctx-size 32768 \
    -ctk q4_0 \
    -mla 3 -fa \
    -b 2048 -ub 2048 \
    -amb 512 \
    -fmoe \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 57 \
    --host 0.0.0.0 \
    --port 10002

if i try to switch the ctk from Q8 to Q4 it crashes with below error:

INFO [   launch_slot_with_task] slot is processing task | tid="135217032957952" timestamp=1749426210 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="135217032957952" timestamp=1749426210 id_slot=0 id_task=0 p0=0
ggml_cuda_cpy_fn: unsupported type combination (q4_0 to f16)
ggml_cuda_cpy_fn: 64 x 2048 x 1; 324 x 10616832 10616832 -> 64 x 2048 x 1; 128 x 262144 x 262144
/home/mukul/dev-ai/ik_llama.cpp/ggml/src/ggml-cuda/cpy.cu:718: fatal error



Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

But if i keep the ctk as q8_0, i can have the context size of 24K with about 45 prompt processing speed, which is comparable to KTransformers.

CUDA_VISIBLE_DEVICES="1" ./build/bin/llama-server \
    --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
    --alias ubergarm/DeepSeek-R1-0528-GGUF \
    --ctx-size 24576 \
    -ctk q8_0 \
    -mla 3 -fa \
    -b 2048 -ub 2048 \
    -amb 512 \
    -fmoe \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 57 \
    --host 0.0.0.0 \
    --port 10002

I will make a full video on this and will post this, unedited version of it, so that you can see everything in the process.

mtcl avatar Jun 09 '25 00:06 mtcl

@ubergarm Would you be able to post a guide on how to make the IQ4 version of the Qwen Model?

mtcl avatar Jun 09 '25 00:06 mtcl

@mtcl

What is the model you are running with KTransformers?

On the "crash": the DeepSeek self attention mechanism is special (different from basically any other model out there), so only f16 and Q8_0 can be used for KV cache. But even if it was supported, I would never use Q4_0 for KV cache as the quality degradation is jsut too much for my taste. The lowest I would go (and only if desperate to reduce VRAM usage) would be Q6_0 (but that is not supported for DeepSeek models).

ikawrakow avatar Jun 09 '25 04:06 ikawrakow

@mtcl

What is the model you are running with KTransformers?

On the "crash": the DeepSeek self attention mechanism is special (different from basically any other model out there), so only f16 and Q8_0 can be used for KV cache. But even if it was supported, I would never use Q4_0 for KV cache as the quality degradation is jsut too much for my taste. The lowest I would go (and only if desperate to reduce VRAM usage) would be Q6_0 (but that is not supported for DeepSeek models).

I am running @ubergarm 's IQ3_K_R4 model located here:

https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_K_R4

mtcl avatar Jun 09 '25 04:06 mtcl

That is with ik_llama.cpp. My question was what model are you running with KTransformers?

ikawrakow avatar Jun 09 '25 04:06 ikawrakow

That is with ik_llama.cpp. My question was what model are you running with KTransformers?

oh sorry! I understand now, i am running a Q4_K_M-FP8 hybrid model, if you want to see how i create the model here is the video walkthrough of it: https://www.youtube.com/watch?v=Xui3_bA26LE Essentially ktransformers team provides a merge script to create these hybrid models out there.

do you know if by using multiple 4090s i can increase context limit? I am also getting a 5090 tomorrow, so potentially it will help with more context on one GPU.

mtcl avatar Jun 09 '25 04:06 mtcl

do you know if by using multiple 4090s i cna increate context limit? I am also getting a 5090 tomorrow, so potentially it will help with more context on one GPU.

Yes, some people with multiple GPU's have reported running full context length. Also, when you have more than 24 GB VRAM you can use -b 4096 -ub 4096 and that will give another factor of nearly 2 increase in prefill performance. Some people have reported ever 200 t/s prefill with DeepSeek-R1/V3. @ubergarm has reported 100+ t/s running CPU-only. I don't have the hardware to run the DeepSeek models, but If I had enough RAM in my Ryzen-7950X box, I expect to get in the range of 50 t/s CPU-only using just this <$500 CPU (I hit 700 t/s with the 16B parameter DeepSeek-Lite that has 15X fewer active parameters than R1/V3).

ikawrakow avatar Jun 09 '25 04:06 ikawrakow

Can you please help me in modifying this command to get more context length with 2X4090 setup.

CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
    --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
    --alias ubergarm/DeepSeek-R1-0528-GGUF \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 3 -fa \
    -b 2048 -ub 2048 \
    -amb 512 \
    -fmoe \
    --n-gpu-layers 63 \
    -ot "blk\.(3|4)\.ffn_.*=CUDA0" \
    -ot "blk\.(5|6)\.ffn_.*=CUDA1" \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 57 \
    --host 0.0.0.0 \
    --port 10002

mtcl avatar Jun 09 '25 04:06 mtcl

Can you post the log? I don't know by heart how much VRAM gets used for model weights and KV cache, and how big CUDA compute buffers are.

ikawrakow avatar Jun 09 '25 04:06 ikawrakow

I think if you are able to offload two layers of experts per GPU you have in the range of 11 GB free on each GPU excuding the experts. It is likely that if you don't offload any experts to the GPU, you can a) nearly double prefill speed by using -b 4096 -ub 4096 or b) increase context length to at least 65k tokens, or c) do a) and b).

ikawrakow avatar Jun 09 '25 05:06 ikawrakow

ok I posted the whole video here, showing every command i ran with all the log outputs.

https://www.youtube.com/watch?v=kDhu0siTvEg

I think if you are able to offload two layers of experts per GPU you have in the range of 11 GB free on each GPU excuding the experts. It is likely that if you don't offload any experts to the GPU, you can a) nearly double prefill speed by using -b 4096 -ub 4096 or b) increase context length to at least 65k tokens, or c) do a) and b).

I am trying to understand how do i achieve this. What command can i run to give you the log here, can you please let me know?

mtcl avatar Jun 09 '25 05:06 mtcl

i tried modifying the command like this, but i get error:

(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$

CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
    --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
    --alias ubergarm/DeepSeek-R1-0528-GGUF \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 3 -fa \
    -b 2048 -ub 2048 \
    -amb 512 \
    -fmoe \
    --n-gpu-layers 63 \
    -ot "blk\.(3)\.ffn_.*=CUDA0" \
    -ot "blk\.(5)\.ffn_.*=CUDA1" \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 57 \
    --host 0.0.0.0 \
    --port 10002
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
INFO [                    main] build info | tid="134935116726272" timestamp=1749447820 build=3737 commit="58f08e43"
INFO [                    main] system info | tid="134935116726272" timestamp=1749447820 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 6 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 52 key-value pairs and 1147 tensors from /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 0528
llama_model_loader: - kv   3:                            general.version str              = 0528
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek-R1
llama_model_loader: - kv   5:                         general.size_label str              = 256x21B
llama_model_loader: - kv   6:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   7:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   8:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   9:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  10:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  11:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  12:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  15:                          general.file_type u32              = 339
llama_model_loader: - kv  16:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  17:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  18:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  19:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  20:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  21:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  22:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  23:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  24:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  25:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  26:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  27:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  28:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  29:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  30:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  31: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  32: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  38:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  41:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  42:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                      quantize.imatrix.file str              = /mnt/raid/models/ubergarm/DeepSeek-R1...
llama_model_loader: - kv  46:                   quantize.imatrix.dataset str              = ubergarm-imatrix-calibration-corpus-v...
llama_model_loader: - kv  47:             quantize.imatrix.entries_count i32              = 721
llama_model_loader: - kv  48:              quantize.imatrix.chunks_count i32              = 812
llama_model_loader: - kv  49:                                   split.no u16              = 0
llama_model_loader: - kv  50:                                split.count u16              = 7
llama_model_loader: - kv  51:                        split.tensors.count i32              = 1147
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  612 tensors
llama_model_loader: - type iq3_k_r4:  116 tensors
llama_model_loader: - type iq4_ks_r4:   58 tensors
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = IQ3_K_R4 - 3.4325 bpw
llm_load_print_meta: model params     = 672.050 B
llm_load_print_meta: model size       = 300.938 GiB (3.847 BPW) 
llm_load_print_meta: repeating layers = 299.104 GiB (3.834 BPW, 670.196 B parameters)
llm_load_print_meta: general.name     = DeepSeek R1 0528
llm_load_print_meta: BOS token        = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: ggml ctx size =    1.40 MiB
Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_up_shexp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_up_shexp.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size = 36486.67 MiB
llm_load_tensors:        CPU buffer size = 43905.23 MiB
llm_load_tensors:        CPU buffer size = 43534.23 MiB
llm_load_tensors:        CPU buffer size = 43534.23 MiB
llm_load_tensors:        CPU buffer size = 43905.23 MiB
llm_load_tensors:        CPU buffer size = 43534.23 MiB
llm_load_tensors:        CPU buffer size = 44473.21 MiB
llm_load_tensors:        CPU buffer size =   938.98 MiB
llm_load_tensors:      CUDA0 buffer size = 13995.99 MiB
llm_load_tensors:      CUDA1 buffer size = 13730.03 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   592.89 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   573.76 MiB
llama_new_context_with_model: KV self size  = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1130415.93 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1185327012864
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '/media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf'
 ERR [              load_model] unable to load model | tid="134935116726272" timestamp=1749447880 model="/media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf"
Segmentation fault (core dumped)

mtcl avatar Jun 09 '25 05:06 mtcl

Try cmake -DGGML_SCHED_MAX_COPIES=1 ...

ikawrakow avatar Jun 09 '25 05:06 ikawrakow