ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

复现了,但DeepSeek-R1-Q4_K_M跑起来速度非常慢,只有约1.5token/s,请问是我配置的原因么?

Open JeffyLapter opened this issue 10 months ago • 30 comments

配置如下: CPU:Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz 显卡:两张 NVIDIA A800 80G 内存:503G

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
Stepping:                        7
CPU MHz:                         800.431
CPU max MHz:                     3200.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4200.00
Virtualization:                  VT-x
L1d cache:                       1 MiB
L1i cache:                       1 MiB
L2 cache:                        32 MiB
L3 cache:                        44 MiB
NUMA node0 CPU(s):               0-15,32-47
NUMA node1 CPU(s):               16-31,48-63
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush d
                                 ts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc 
                                 art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pn
                                 i pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p
                                 cid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c r
                                 drand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single inte
                                 l_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vp
                                 id ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx5
                                 12f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl 
                                 xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local 
                                 dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_v
                                 nni md_clear flush_l1d arch_capabilities
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A800 80GB PCIe          Off | 00000000:18:00.0 Off |                    0 |
| N/A   71C    P0             101W / 300W |  66675MiB / 81920MiB |      5%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80GB PCIe          Off | 00000000:AF:00.0 Off |                    0 |
| N/A   71C    P0              88W / 300W |  37095MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
inxi -m
Memory:    RAM: total: 503.55 GiB used: 56.39 GiB (11.2%) 
           Array-1: capacity: 2 TiB slots: 24 EC: Single-bit ECC 
           Device-1: CPU1_DIMM_A1 size: 32 GiB speed: 2933 MT/s 
           Device-2: CPU1_DIMM_A2 size: 32 GiB speed: 2933 MT/s 
           Device-3: CPU1_DIMM_B1 size: 32 GiB speed: 2933 MT/s 
           Device-4: CPU1_DIMM_B2 size: 32 GiB speed: 2933 MT/s 
           Device-5: CPU1_DIMM_C1 size: No Module Installed 
           Device-6: CPU1_DIMM_C2 size: No Module Installed 
           Device-7: CPU1_DIMM_D1 size: 32 GiB speed: 2933 MT/s 
           Device-8: CPU1_DIMM_D2 size: 32 GiB speed: 2933 MT/s 
           Device-9: CPU1_DIMM_E1 size: 32 GiB speed: 2933 MT/s 
           Device-10: CPU1_DIMM_E2 size: 32 GiB speed: 2933 MT/s 
           Device-11: CPU1_DIMM_F1 size: No Module Installed 
           Device-12: CPU1_DIMM_F2 size: No Module Installed 
           Device-13: CPU2_DIMM_A1 size: 32 GiB speed: 2933 MT/s 
           Device-14: CPU2_DIMM_A2 size: 32 GiB speed: 2933 MT/s 
           Device-15: CPU2_DIMM_B1 size: 32 GiB speed: 2933 MT/s 
           Device-16: CPU2_DIMM_B2 size: 32 GiB speed: 2933 MT/s 
           Device-17: CPU2_DIMM_C1 size: No Module Installed 
           Device-18: CPU2_DIMM_C2 size: No Module Installed 
           Device-19: CPU2_DIMM_D1 size: 32 GiB speed: 2933 MT/s 
           Device-20: CPU2_DIMM_D2 size: 32 GiB speed: 2933 MT/s 
           Device-21: CPU2_DIMM_E1 size: 32 GiB speed: 2933 MT/s 
           Device-22: CPU2_DIMM_E2 size: 32 GiB speed: 2933 MT/s 
           Device-23: CPU2_DIMM_F1 size: No Module Installed 
           Device-24: CPU2_DIMM_F2 size: No Module Installed 

一开始我是直接起的,命令如下:

python ./ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-R1 --gguf_path ./DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/ --optimize_rule_path /kt/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-R1-Dbg.yaml --cpu_infer 64 --max_new_tokens 10000 --force_think true

这时候差不多有个0.4t/s

Image

之后我把optimize换成ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu,把另一块显卡也调动起来,速度差不多1.4t/s。

Image

之后我根据FAQ中,基于ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-4.yaml做了些修改,跑在我的双卡A800上,但是速度还是提不起来。启动命令如下

numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-R1 --gguf_path ./DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/ --optimize_rule_path /kt/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-R1-moe.yaml --cpu_infer 16 --max_new_tokens 10000 --force_think true --use_cuda_graph=False

这一次差不多有个1.4t/s

DeepSeek-R1-moe.yaml如下:

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"

# === Rotary Embedding Replacement ===

# GPU 0: layers 0–14
- match:
    name: "^model\\.layers\\.([0-9]|1[0-4])\\."
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"

# GPU 1: layers 15–29
- match:
    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\."
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

# GPU 2: layers 30–44
- match:
    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\."
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"

# GPU 3: layers 45–60
- match:
    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\."
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

# === Linear Layers Replacement (excluding self_attn.kv_b_proj) ===

# GPU 0: layers 0–14
- match:
    name: "^model\\.layers\\.([0-9]|1[0-4])\\.(?!self_attn\\.kv_b_proj).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"

# GPU 1: layers 15–29
- match:
    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\.(?!self_attn\\.kv_b_proj).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"

# GPU 2: layers 30–44
- match:
    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\.(?!self_attn\\.kv_b_proj).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"

# GPU 3: layers 45–60
- match:
    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.(?!self_attn\\.kv_b_proj).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"

# === MLP (MoE) Replacement ===

# GPU 0: layers 0–14
- match:
    name: "^model\\.layers\\.([0-9]|1[0-4])\\.mlp$"
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV3MoE
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"

# GPU 1: layers 15–29
- match:
    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV3MoE
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

# GPU 2: layers 30–44
- match:
    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\.mlp$"
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV3MoE
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"

# GPU 3: layers 45–60
- match:
    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.mlp$"
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV3MoE
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

# === MLP Gate Replacement ===

# GPU 0: layers 0–14
- match:
    name: "^model\\.layers\\.([0-9]|1[0-4])\\.mlp\\.gate$"
    class: ktransformers.models.modeling_deepseek_v3.MoEGate
  replace:
    class: ktransformers.operators.gate.KMoEGate
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"

# GPU 1: layers 15–29
- match:
    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\.mlp\\.gate$"
    class: ktransformers.models.modeling_deepseek_v3.MoEGate
  replace:
    class: ktransformers.operators.gate.KMoEGate
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

# GPU 2: layers 30–44
- match:
    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\.mlp\\.gate$"
    class: ktransformers.models.modeling_deepseek_v3.MoEGate
  replace:
    class: ktransformers.operators.gate.KMoEGate
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"

# GPU 3: layers 45–60
- match:
    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.mlp\\.gate$"
    class: ktransformers.models.modeling_deepseek_v3.MoEGate
  replace:
    class: ktransformers.operators.gate.KMoEGate
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

#=== MLP Experts Replacement ===
#replace with marlin expert. Open and modify layer-num as needed.
#Each layer of malin experts takes about 6GB of GPU memory.
#!!!Do remember 'close' cuda graph if you are using marlin expert.!!!
#!!!KExpertsTorch is untested, we don't have enough VRAM.!!!
# GPU 0: layers 3–4  2
#- match:
#    name: "^model\\.layers\\.([3-4])\\.mlp\\.experts$"
#  replace:
#    class: ktransformers.operators.experts.KTransformersExperts
#    kwargs:
#      generate_device: "cuda:0"
#      generate_op:  "KExpertsMarlin"
#  recursive: False
# GPU 1: layers 15–17 2
- match:
    name: "^model\\.layers\\.(1[5-7])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      generate_device: "cuda:1"
      generate_op:  "KExpertsMarlin"
  recursive: False
# GPU 2: layers 30–32 2
- match:
    name: "^model\\.layers\\.(3[0-2])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      generate_device: "cuda:0"
      generate_op:  "KExpertsMarlin"
  recursive: False
# GPU 3: layers 45–46 2
#- match:
#    name: "^model\\.layers\\.(4[5-6])\\.mlp\\.experts$"
#  replace:
#    class: ktransformers.operators.experts.KTransformersExperts
#    kwargs:
#      generate_device: "cuda:1"
#      generate_op:  "KExpertsMarlin"
#  recursive: False


# === MLP Experts Replacement ===

# GPU 0: layers 0–14
- match:
    name: "^model\\.layers\\.([0-9]|1[0-4])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op: "KExpertsCPU"
      out_device: "cuda:0"
  recursive: False

# GPU 1: layers 15–29
- match:
    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:1"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op: "KExpertsCPU"
      out_device: "cuda:1"
  recursive: False

# GPU 2: layers 30–44
- match:
    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op: "KExpertsCPU"
      out_device: "cuda:0"
  recursive: False

# GPU 3: layers 45–60
- match:
    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:1"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op: "KExpertsCPU"
      out_device: "cuda:1"
  recursive: False

# === Self-Attention Replacement ===

# GPU 0: layers 0–14
- match:
    name: "^model\\.layers\\.([0-9]|1[0-4])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"

# GPU 1: layers 15–29
- match:
    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

# GPU 2: layers 30–44
- match:
    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"

# GPU 3: layers 45–60
- match:
    name: "^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

# === Overall Model Replacement with Transfer Map ===

- match:
    name: "^model$"
  replace:
    class: "ktransformers.operators.models.KDeepseekV2Model"
    kwargs:
      per_layer_prefill_intput_threshold: 0 # 0 means close layer‐wise prefill
      transfer_map:
        15: "cuda:1" # Layers 15+ on GPU 1
        30: "cuda:0" # Layers 30+ on GPU 2
        45: "cuda:1" # Layers 45+ on GPU 3

# === Default Catch-All for Other Modules ===

# GPU 0: layers 0–14
- match:
    name: "^model\\.layers\\.([0-9]|1[0-4])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"

# GPU 1: layers 15–29
- match:
    name: "^model\\.layers\\.(1[5-9]|2[0-9])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

# GPU 2: layers 30–44
- match:
    name: "^model\\.layers\\.(3[0-9]|4[0-4])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"

# don't inject lm_head if already inject marlin experts

# For final modules (model.norm and lm_head), ensure they are on GPU 3 (as in your original config)
- match:
    name: "(^model\\.layers\\.(4[5-9]|5[0-9]|60)\\.)|(^model\\.norm)|(^lm_head)"
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

大致速度1.4-1.5t/s左右

Image

使用的DeepSeek版本:unsloth/DeepSeek-R1-GGUF 中的 DeepSeek-R1-Q4_K_M。 我的ktransformers版本:ktransformers 0.2.1+cu123torch26fancy 我的cuda 版本:12.3

按照作者的配置,内存带宽在600G/s左右,我计算后我的内存峰值带宽,单路应该差不多是93.86g/s,双路应该是187.72 GB/s,原作者的速度能跑到12t/s,那我的上限应该在3t/s左右,为什么最后跑起来最快只能到1.5t/s呢?

JeffyLapter avatar Feb 19 '25 02:02 JeffyLapter

哥,我这里也是 0.2.1 版本。一直报错如下,都不知道怎么回事。 Building wheels for collected packages: ktransformers Building wheel for ktransformers (pyproject.toml) ... done Created wheel for ktransformers: filename=ktransformers-0.2.1-cp310-cp310-linux_x86_64.whl size=28304186 sha256=464a7862e6d69804b26bb40cd8681e2fc3883513a2db026d01c3e0b147b0d430 Stored in directory: /root/.cache/pip/wheels/ed/7a/7a/f8905ab90c6c356c64ba6284fa2ce0cf84c5610639299afa81 WARNING: Built wheel for ktransformers is invalid: Wheel has unexpected file name: expected '0.2.1+cu124torch24fancy', got '0.2.1' Failed to build ktransformers ERROR: Failed to build installable wheels for some pyproject.toml based projects (ktransformers)

jinec avatar Feb 19 '25 05:02 jinec

哥,我这里也是 0.2.1 版本。一直报错如下,都不知道怎么回事。 Building wheels for collected packages: ktransformers Building wheel for ktransformers (pyproject.toml) ... done Created wheel for ktransformers: filename=ktransformers-0.2.1-cp310-cp310-linux_x86_64.whl size=28304186 sha256=464a7862e6d69804b26bb40cd8681e2fc3883513a2db026d01c3e0b147b0d430 Stored in directory: /root/.cache/pip/wheels/ed/7a/7a/f8905ab90c6c356c64ba6284fa2ce0cf84c5610639299afa81 WARNING: Built wheel for ktransformers is invalid: Wheel has unexpected file name: expected '0.2.1+cu124torch24fancy', got '0.2.1' Failed to build ktransformers ERROR: Failed to build installable wheels for some pyproject.toml based projects (ktransformers)

你是手动下载的wheel然后安装的么

JeffyLapter avatar Feb 19 '25 06:02 JeffyLapter

我也复现了,速度也是非常的慢,我是2张A100显卡。模型跟你用的一样,问问题,半天出来一个字。还没解决,持续关注该问题。如果解决了麻烦大佬给下解决方案。

yimisiyang avatar Feb 19 '25 06:02 yimisiyang

用单卡应该要快很多

GuardSkill avatar Feb 19 '25 06:02 GuardSkill

我也复现了,速度也是非常的慢,我是2张A100显卡。模型跟你用的一样,问问题,半天出来一个字。还没解决,持续关注该问题。如果解决了麻烦大佬给下解决方案。

有几个点,一个是芯片型号,核心数设置物理核心数, 第二个是服务器的内存带宽多大(内存条插满才是最大,我现在的配置计算下来是384GB 左右的内存带宽) 其他我也不清楚了 使用的是v 0.2.1的, v 0.3的正在下载BF16 的满血模型,环境安装完成了到时候我再试试

export USE_NUMA=1 (我这里有2 numa nodes 我不清楚这里应该设置什么,当我export USE_NUMA=2 的时候 cpu才是满负载) ktransformers --model_path xxx/DeepSeek-R1-GGUF --gguf_path xxx/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M --cpu_infer 97 --max_new_tokens 4000

GPU: 2 * A100 80GB CPU: DDR5 64GB * 8 CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 sockets, 2 numa nodes

Performance(T/s): prefill 102.44033021701229, decode 5.720198065231109.

Zongru-Wang avatar Feb 19 '25 06:02 Zongru-Wang

直接这样运行速度快了。

numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path /data/model/models--deepseek-ai--DeepSeek-R1/snapshots/8a58a132790c9935686eb97f042afa8013451c9f/ --gguf_path /data/gguf_model/DeepSeek-R1-Q4_K_M --optimize_rule_path /data/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-R1-Chat.yaml --cpu_infer 30 --max_new_tokens 1000

这里面DeepSeek-R1-Chat.yaml的配置就是DeepSeek-V3-Chat.yaml中的复制版本

这是这次的推理速度

Chat: 1+1等于几?
嗯,用户问1+1等于几,这看起来是一个很基础的数学问题。不过,作为刚上线的人工智能,我需要仔细思考,确保回答准确。首先,我得确认用户的问题是否有隐藏的含义,或者是不是在测试我的基本运算能力。

首先,从数学的角度来看,1加1在十进制中确实是等于2的。这是最基本的算术运算之一,通常在幼儿园或小学一年级就会学到。不过,有时候可能会有其他情况需要考虑,比如在不同的数学系统里,结果可能会不同。例如,在二进制中,1+1等于10,但用户的问题并没有提到二进制,所以应该默认是十进制。

接下来,我需要考虑用户可能的意图。用户可能真的在问一个简单的问题,或者可能有更深层次的目的,比如测试我的反应速度,或者看看我是否会被简单的问题难倒。也有可能用户想确认我的回答是否可靠,尤其是在处理基础问题时。

另外,还要考虑是否有其他可能性,比如在某些逻辑系统或不同的数学理论中,1+1可能有不同的结果。例如,在布尔代数中,1代表真,那么1+1可能还是1,因为逻辑或运算中真或真仍然为真。但通常加法运算在布尔代数中可能指的是逻辑或,而算术加法则是不同的。不过用户的问题显然是在问算术加法,所以应该排除这种情况。

也有可能用户是在开玩笑,或者想看看我会不会给出一个幽默的回答,比如“1+1等于窗户”之类的谜语。但作为人工智能,我的首要任务是提供准确的答案,而不是猜测用户是否有隐藏的幽默意图,除非有明显的提示。

再检查一下是否有语言上的歧义。中文中的“1+1”通常就是数学加法,不会有其他含义。所以综合来看,最合理的答案就是2。不过为了确保万无一失,我可以再回顾一下数学的基本原理,确认在标准算术中,1+1确实等于2,没有任何例外情况。

最后,考虑到用户可能是一个刚开始学习数学的孩子,或者只是想确认我的回答是否正确,所以给出明确且正确的答案是最重要的。不需要复杂化问题,直接回答2即可,同时也可以补充说明这是基本的算术结果,以增强用户的信心。
</think>

1+1等于2。这是基础的算术运算结果,符合十进制数学的基本规则。如果在其他进制(如二进制)中,结果会有所不同,但通常情况下默认使用十进制,因此正确答案是2。
prompt eval count:    11 token(s)
prompt eval duration: 1.317887783050537s
prompt eval rate:     8.346689408212066 tokens/s
eval count:           485 token(s)
eval duration:        85.2873182296753s
eval rate:            5.686660221791881 tokens/s
Chat: 

yimisiyang avatar Feb 19 '25 07:02 yimisiyang

Injecting model.norm as default Injecting lm_head as default loading token_embd.weight to cpu Traceback (most recent call last): File "/opt/conda/bin/ktransformers", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/ktransformers/server/main.py", line 114, in main create_interface(config=cfg, default_args=cfg) File "/opt/conda/lib/python3.10/site-packages/ktransformers/server/utils/create_interface.py", line 27, in create_interface GlobalInterface.interface = BackendInterface(default_args) File "/opt/conda/lib/python3.10/site-packages/ktransformers/server/backend/interfaces/ktransformers.py", line 47, in init optimize_and_load_gguf(self.model, optimize_rule_path, gguf_path, config) File "/opt/conda/lib/python3.10/site-packages/ktransformers/optimize/optimize.py", line 129, in optimize_and_load_gguf load_weights(module, gguf_loader) File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 83, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 85, in load_weights module.load() File "/opt/conda/lib/python3.10/site-packages/ktransformers/operators/base_operator.py", line 60, in load utils.load_weights(child, self.gguf_loader, self.key+".") File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 83, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 81, in load_weights load_cur_state_dict(module, gguf_loader, prefix) File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 71, in load_cur_state_dict weights = gguf_loader.load_gguf_tensor(translated_key, device = device).to(dtype = target_dtype) File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/custom_gguf.py", line 298, in load_gguf_tensor values = GGML_DEQUANTIZEggml_name File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/custom_gguf.py", line 462, in dequantize_q4_k data_f16 = np.frombuffer(data, dtype=np.float16).reshape(num_blocks, block_size // 2) ValueError: cannot reshape array of size 142615888 into shape (1980776,72)

使用官方的0.2.0的镜像,报错,不知道什么原因,大佬有遇到的么?

haojiubujian1985 avatar Feb 19 '25 07:02 haojiubujian1985

@JeffyLapter 哥,我安装 https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/install.md 差不多解决了,感谢

jinec avatar Feb 19 '25 07:02 jinec

直接这样运行速度快了。

numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path /data/model/models--deepseek-ai--DeepSeek-R1/snapshots/8a58a132790c9935686eb97f042afa8013451c9f/ --gguf_path /data/gguf_model/DeepSeek-R1-Q4_K_M --optimize_rule_path /data/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-R1-Chat.yaml --cpu_infer 30 --max_new_tokens 1000

这里面DeepSeek-R1-Chat.yaml的配置就是DeepSeek-V3-Chat.yaml中的复制版本

这是这次的推理速度

Chat: 1+1等于几?
嗯,用户问1+1等于几,这看起来是一个很基础的数学问题。不过,作为刚上线的人工智能,我需要仔细思考,确保回答准确。首先,我得确认用户的问题是否有隐藏的含义,或者是不是在测试我的基本运算能力。

首先,从数学的角度来看,1加1在十进制中确实是等于2的。这是最基本的算术运算之一,通常在幼儿园或小学一年级就会学到。不过,有时候可能会有其他情况需要考虑,比如在不同的数学系统里,结果可能会不同。例如,在二进制中,1+1等于10,但用户的问题并没有提到二进制,所以应该默认是十进制。

接下来,我需要考虑用户可能的意图。用户可能真的在问一个简单的问题,或者可能有更深层次的目的,比如测试我的反应速度,或者看看我是否会被简单的问题难倒。也有可能用户想确认我的回答是否可靠,尤其是在处理基础问题时。

另外,还要考虑是否有其他可能性,比如在某些逻辑系统或不同的数学理论中,1+1可能有不同的结果。例如,在布尔代数中,1代表真,那么1+1可能还是1,因为逻辑或运算中真或真仍然为真。但通常加法运算在布尔代数中可能指的是逻辑或,而算术加法则是不同的。不过用户的问题显然是在问算术加法,所以应该排除这种情况。

也有可能用户是在开玩笑,或者想看看我会不会给出一个幽默的回答,比如“1+1等于窗户”之类的谜语。但作为人工智能,我的首要任务是提供准确的答案,而不是猜测用户是否有隐藏的幽默意图,除非有明显的提示。

再检查一下是否有语言上的歧义。中文中的“1+1”通常就是数学加法,不会有其他含义。所以综合来看,最合理的答案就是2。不过为了确保万无一失,我可以再回顾一下数学的基本原理,确认在标准算术中,1+1确实等于2,没有任何例外情况。

最后,考虑到用户可能是一个刚开始学习数学的孩子,或者只是想确认我的回答是否正确,所以给出明确且正确的答案是最重要的。不需要复杂化问题,直接回答2即可,同时也可以补充说明这是基本的算术结果,以增强用户的信心。
</think>

1+1等于2。这是基础的算术运算结果,符合十进制数学的基本规则。如果在其他进制(如二进制)中,结果会有所不同,但通常情况下默认使用十进制,因此正确答案是2。
prompt eval count:    11 token(s)
prompt eval duration: 1.317887783050537s
prompt eval rate:     8.346689408212066 tokens/s
eval count:           485 token(s)
eval duration:        85.2873182296753s
eval rate:            5.686660221791881 tokens/s
Chat: 

您是说不用双卡的optimize文件?用单卡跑比双卡要快么?我的optimize是拿DeepSeek-V3-Chat-muti-4gpu.yaml改的,把里面cuda:2和cuda:3分别改成cuda:0和cuda:1了

JeffyLapter avatar Feb 19 '25 07:02 JeffyLapter

直接这样运行速度快了。

numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path /data/model/models--deepseek-ai--DeepSeek-R1/snapshots/8a58a132790c9935686eb97f042afa8013451c9f/ --gguf_path /data/gguf_model/DeepSeek-R1-Q4_K_M --optimize_rule_path /data/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-R1-Chat.yaml --cpu_infer 30 --max_new_tokens 1000

这里面DeepSeek-R1-Chat.yaml的配置就是DeepSeek-V3-Chat.yaml中的复制版本 这是这次的推理速度

Chat: 1+1等于几?
嗯,用户问1+1等于几,这看起来是一个很基础的数学问题。不过,作为刚上线的人工智能,我需要仔细思考,确保回答准确。首先,我得确认用户的问题是否有隐藏的含义,或者是不是在测试我的基本运算能力。

首先,从数学的角度来看,1加1在十进制中确实是等于2的。这是最基本的算术运算之一,通常在幼儿园或小学一年级就会学到。不过,有时候可能会有其他情况需要考虑,比如在不同的数学系统里,结果可能会不同。例如,在二进制中,1+1等于10,但用户的问题并没有提到二进制,所以应该默认是十进制。

接下来,我需要考虑用户可能的意图。用户可能真的在问一个简单的问题,或者可能有更深层次的目的,比如测试我的反应速度,或者看看我是否会被简单的问题难倒。也有可能用户想确认我的回答是否可靠,尤其是在处理基础问题时。

另外,还要考虑是否有其他可能性,比如在某些逻辑系统或不同的数学理论中,1+1可能有不同的结果。例如,在布尔代数中,1代表真,那么1+1可能还是1,因为逻辑或运算中真或真仍然为真。但通常加法运算在布尔代数中可能指的是逻辑或,而算术加法则是不同的。不过用户的问题显然是在问算术加法,所以应该排除这种情况。

也有可能用户是在开玩笑,或者想看看我会不会给出一个幽默的回答,比如“1+1等于窗户”之类的谜语。但作为人工智能,我的首要任务是提供准确的答案,而不是猜测用户是否有隐藏的幽默意图,除非有明显的提示。

再检查一下是否有语言上的歧义。中文中的“1+1”通常就是数学加法,不会有其他含义。所以综合来看,最合理的答案就是2。不过为了确保万无一失,我可以再回顾一下数学的基本原理,确认在标准算术中,1+1确实等于2,没有任何例外情况。

最后,考虑到用户可能是一个刚开始学习数学的孩子,或者只是想确认我的回答是否正确,所以给出明确且正确的答案是最重要的。不需要复杂化问题,直接回答2即可,同时也可以补充说明这是基本的算术结果,以增强用户的信心。
</think>

1+1等于2。这是基础的算术运算结果,符合十进制数学的基本规则。如果在其他进制(如二进制)中,结果会有所不同,但通常情况下默认使用十进制,因此正确答案是2。
prompt eval count:    11 token(s)
prompt eval duration: 1.317887783050537s
prompt eval rate:     8.346689408212066 tokens/s
eval count:           485 token(s)
eval duration:        85.2873182296753s
eval rate:            5.686660221791881 tokens/s
Chat: 

您是说不用双卡的optimize文件?用单卡跑比双卡要快么?我的optimize是拿DeepSeek-V3-Chat-muti-4gpu.yaml改的,把里面cuda:2和cuda:3分别改成cuda:0和cuda:1了

刚刚试了下,这样跑更慢了,变成0.2t/s了,不清楚啥原因。

numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-R1 --gguf_path ./DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/ --optimize_rule_path /kt/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --cpu_infer 30 --max_new_tokens 1000 --force_think true 
Image

JeffyLapter avatar Feb 19 '25 08:02 JeffyLapter

可能原因有2

  1. 没有使用更多的GPU算力(写死了只能用16G显存吗?)
  2. CPU 没有AMX指令扩展

sweihub avatar Feb 19 '25 09:02 sweihub

这个版本跑起来 如何使用呀,

AK760 avatar Feb 19 '25 15:02 AK760

  1. 我推理时开启top观察cpu占用量,调参数cpu_infer,将cpu占用量拉高,可以将速度提快
  2. 我将embedding层也放在cuda上,会报错, 模型代码里太多to('cpu')写死了,需要去慢慢改,可以将再多点权重放在gpu上

Wilbur0626 avatar Feb 20 '25 03:02 Wilbur0626

  1. 我推理时开启top观察cpu占用量,调参数cpu_infer,将cpu占用量拉高,可以将速度提快
  2. 我将embedding层也放在cuda上,会报错, 模型代码里太多to('cpu')写死了,需要去慢慢改,可以将再多点权重放在gpu上

embedding层不需要放到cpu,是计算稀疏的,收益很低。

Azure-Tang avatar Feb 20 '25 04:02 Azure-Tang

  1. 我推理时开启top观察cpu占用量,调参数cpu_infer,将cpu占用量拉高,可以将速度提快
  2. 我将embedding层也放在cuda上,会报错, 模型代码里太多to('cpu')写死了,需要去慢慢改,可以将再多点权重放在gpu上

embedding层不需要放到cpu,是计算稀疏的,收益很低。

好的,感谢,不是很了解网络结构,我说错了,不是ktransformers.operators.RoPE.YarnRotaryEmbeddingV3,是model.embed_tokens,我是尝试将这部分放在cuda上

Wilbur0626 avatar Feb 20 '25 06:02 Wilbur0626

  1. 我推理时开启top观察cpu占用量,调参数cpu_infer,将cpu占用量拉高,可以将速度提快
  2. 我将embedding层也放在cuda上,会报错, 模型代码里太多to('cpu')写死了,需要去慢慢改,可以将再多点权重放在gpu上

embedding层不需要放到cpu,是计算稀疏的,收益很低。

目前我的GPU还空闲很多,想了解把哪些部分移过来比较合适,主要目的是为了提升GPU利用率来进一步调大cpu_infer,应该能进一步提升速度

Wilbur0626 avatar Feb 20 '25 06:02 Wilbur0626

  1. 我推理时开启top观察cpu占用量,调参数cpu_infer,将cpu占用量拉高,可以将速度提快
  2. 我将embedding层也放在cuda上,会报错, 模型代码里太多to('cpu')写死了,需要去慢慢改,可以将再多点权重放在gpu上

embedding层不需要放到cpu,是计算稀疏的,收益很低。

目前我的GPU还空闲很多,想了解把哪些部分移过来比较合适,主要目的是为了提升GPU利用率来进一步调大cpu_infer,应该能进一步提升速度

估计不太行了,大头还是专家层expert,但expert放到GPU会导致用不了cuda_graph,速度还不如放CPU

yileld avatar Feb 20 '25 06:02 yileld

  1. 我推理时开启top观察cpu占用量,调参数cpu_infer,将cpu占用量拉高,可以将速度提快
  2. 我将embedding层也放在cuda上,会报错, 模型代码里太多to('cpu')写死了,需要去慢慢改,可以将再多点权重放在gpu上

embedding层不需要放到cpu,是计算稀疏的,收益很低。

目前我的GPU还空闲很多,想了解把哪些部分移过来比较合适,主要目的是为了提升GPU利用率来进一步调大cpu_infer,应该能进一步提升速度

估计不太行了,大头还是专家层expert,但expert放到GPU会导致用不了cuda_graph,速度还不如放CPU

哦哦,所以不能简单的去想。expert是不能分开放的,但是GPU装不下,所以才放在cpu上,多谢多谢,学习到了

Wilbur0626 avatar Feb 20 '25 06:02 Wilbur0626

发现一个问题,最开始几轮对话比较慢,多问几次速度就会逐渐提升...

tianwaifeidie avatar Feb 20 '25 07:02 tianwaifeidie

你的内存插法可能不是最优化,理论上每个CPU 6通道,总共12通道,24个插槽。 建议插12条完全一样的内存条在12个通道上,或者24个插满,否则都影响内存性能。

yeungtuzi avatar Feb 20 '25 07:02 yeungtuzi

  1. 我推理时开启top观察cpu占用量,调参数cpu_infer,将cpu占用量拉高,可以将速度提快
  2. 我将embedding层也放在cuda上,会报错, 模型代码里太多to('cpu')写死了,需要去慢慢改,可以将再多点权重放在gpu上

embedding层不需要放到cpu,是计算稀疏的,收益很低。

好的,感谢,不是很了解网络结构,我说错了,不是ktransformers.operators.RoPE.YarnRotaryEmbeddingV3,是model.embed_tokens,我是尝试将这部分放在cuda上

我就是说这个,embedding层不用从cpu挪到gpu。有点回串了。

我觉得这位朋友说的应该就是你的问题:

你的内存插法可能不是最优化,理论上每个CPU 6通道,总共12通道,24个插槽。 建议插12条完全一样的内存条在12个通道上,或者24个插满,否则都影响内存性能。

Azure-Tang avatar Feb 20 '25 08:02 Azure-Tang

发现一个问题,最开始几轮对话比较慢,多问几次速度就会逐渐提升...

需要warm up呀

yansiyu550 avatar Feb 20 '25 08:02 yansiyu550

大家好。 放的是 deepseek-ai/DeepSeek-R1. 放到是 gguf的 DeepSeek-R1-Q4_K_M 对吗?我没搞懂放什么。如果放原版的 deepseek-ai/DeepSeek-R1,那真的很难下载

jinec avatar Feb 20 '25 10:02 jinec

我没搞懂放什么

这是我的命令你可以参考一下:

python -m ktransformers.local_chat --model_path /workspace/Deepseek-models/DeepSeek-R1 --gguf_path /workspace/Deepseek-models/DeepSeek-R1-Q4_K_M --cpu_infer 65 --max_new_tokens 1000

root@lts-4090:/workspace/ktransformers# ls /workspace/Deepseek-models/DeepSeek-R1 config.json configuration_deepseek.py tokenizer.json configuration.json generation_config.json tokenizer_config.json

root@lts-4090:/workspace/ktransformers# ls /workspace/Deepseek-models/DeepSeek-R1-Q4_K_M DeepSeek-R1-Q4_K_M-00001-of-00009.gguf DeepSeek-R1-Q4_K_M-00006-of-00009.gguf DeepSeek-R1-Q4_K_M-00002-of-00009.gguf DeepSeek-R1-Q4_K_M-00007-of-00009.gguf DeepSeek-R1-Q4_K_M-00003-of-00009.gguf DeepSeek-R1-Q4_K_M-00008-of-00009.gguf DeepSeek-R1-Q4_K_M-00004-of-00009.gguf DeepSeek-R1-Q4_K_M-00009-of-00009.gguf DeepSeek-R1-Q4_K_M-00005-of-00009.gguf

yansiyu550 avatar Feb 20 '25 11:02 yansiyu550

我2块A800 80G gpu ,72核cpu,可以把cpu用满,但是显卡利用利只有10%, 都是用cpu在计算了,怎么把显卡性能用起来,显卡还有90%性能没用上,只用了10G显存,怎调?各位有办法吗?

用这个参数会快一些: --optimize_rule_path ./ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml

prompt eval count: 60 token(s) prompt eval duration: 3.1178791522979736s prompt eval rate: 19.243850408948063 tokens/s eval count: 3673 token(s) eval duration: 690.929929971695s eval rate: 5.316023869671517 tokens/s

txg1550759 avatar Feb 20 '25 14:02 txg1550759

我2块A800 80G gpu ,72核cpu,可以把cpu用满,但是显卡利用利只有10%, 都是用cpu在计算了,怎么把显卡性能用起来,显卡还有90%性能没用上,只用了10G显存,怎调?各位有办法吗?

用这个参数会快一些: --optimize_rule_path ./ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml

prompt eval count: 60 token(s) prompt eval duration: 3.1178791522979736s prompt eval rate: 19.243850408948063 tokens/s eval count: 3673 token(s) eval duration: 690.929929971695s eval rate: 5.316023869671517 tokens/s

我也是遇到了这样的问题,我是2*24GB的两个A5000,但是只有单卡利用了30%。我看有其他issues提到用了--optimize_rule_path ./ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml 也只是把单卡30%变成双卡15%,总的利用情况没有改变(https://github.com/kvcache-ai/ktransformers/issues/537#issuecomment-2673263660)。 另外想请教下,现在KTransformer有没有办法支持并行呢(比如async多个API query)。 有没有佬能帮忙解答一下呢? @Azure-Tang

jeremyzhangsq avatar Feb 21 '25 04:02 jeremyzhangsq

我2块A800 80G gpu ,72核cpu,可以把cpu用满,但是显卡利用利只有10%, 都是用cpu在计算了,怎么把显卡性能用起来,显卡还有90%性能没用上,只用了10G显存,怎调?各位有办法吗? 用这个参数会快一些: --optimize_rule_path ./ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml prompt eval count: 60 token(s) prompt eval duration: 3.1178791522979736s prompt eval rate: 19.243850408948063 tokens/s eval count: 3673 token(s) eval duration: 690.929929971695s eval rate: 5.316023869671517 tokens/s

我也是遇到了这样的问题,我是2*24GB的两个A5000,但是只有单卡利用了30%。我看有其他issues提到用了--optimize_rule_path ./ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml 也只是把单卡30%变成双卡15%,总的利用情况没有改变(https://github.com/kvcache-ai/ktransformers/issues/537#issuecomment-2673263660)。%E3%80%82) 另外想请教下,现在KTransformer有没有办法支持并行呢(比如async多个API query)。 有没有佬能帮忙解答一下呢? @Azure-Tang

目前多卡只是简单做了pipeline给大家利用显存以开大上下文,对推理速度没有帮助。以及目前还是只支持单并发,还不支持并行。目前的架构主要还是卡CPU的计算带宽,所以以后即使我们支持了tp也并不会有太大幅的提升

Azure-Tang avatar Feb 27 '25 06:02 Azure-Tang

以DeepSeek-V3-Chat-multi-gpu-4.yaml为例

打开官方注释文档即可(后面的配置都不要改,官方文档里提到先配置的优先), 大概每层要6个G(比如 layers 3–4 就是12个G左右),看自己情况加层级即可

另外,仔细阅读一下,官方文档 https://kvcache-ai.github.io/ktransformers/en/injection_tutorial.html

# === MLP Experts Replacement ===
# replace with marlin expert. Open and modify layer-num as needed.
# Each layer of malin experts takes about 6GB of GPU memory.
# !!!Do remember 'close' cuda graph if you are using marlin expert.!!!
# !!!KExpertsTorch is untested, we don't have enough VRAM.!!!

# GPU 0: layers 3–4
# - match:
#     name: "^model\\.layers\\.([3-4])\\.mlp\\.experts$"
#   replace:
#     class: ktransformers.operators.experts.KTransformersExperts
#     kwargs:
#       generate_device: "cuda:0"
#       generate_op:  "KExpertsMarlin"
#   recursive: False

# # GPU 1: layers 15–17
# - match:
#     name: "^model\\.layers\\.(1[5-7])\\.mlp\\.experts$"
#   replace:
#     class: ktransformers.operators.experts.KTransformersExperts
#     kwargs:
#       generate_device: "cuda:1"
#       generate_op:  "KExpertsMarlin"
#   recursive: False

# # GPU 2: layers 30–32
# - match:
#     name: "^model\\.layers\\.(3[0-2])\\.mlp\\.experts$"
#   replace:
#     class: ktransformers.operators.experts.KTransformersExperts
#     kwargs:
#       generate_device: "cuda:2"
#       generate_op:  "KExpertsMarlin"
#   recursive: False

# # GPU 3: layers 45–46
# - match:
#     name: "^model\\.layers\\.(4[5-6])\\.mlp\\.experts$"
#   replace:
#     class: ktransformers.operators.experts.KTransformersExperts
#     kwargs:
#       generate_device: "cuda:3"
#       generate_op:  "KExpertsMarlin"
#   recursive: False

zwbsxwt avatar Feb 28 '25 08:02 zwbsxwt

大家好。 放的是 deepseek-ai/DeepSeek-R1. 放到是 gguf的 DeepSeek-R1-Q4_K_M 对吗?我没搞懂放什么。如果放原版的 deepseek-ai/DeepSeek-R1,那真的很难下载

什么意思?是下载的 gguf 吗?我从国内的 https://www.modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/files 里面下的,速度比较快

forMwish avatar Mar 12 '25 07:03 forMwish

内存带宽不够吧,内存要DDR5

risannoheya avatar Jul 21 '25 05:07 risannoheya