ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

convert_cpu_weights DeepSeek R1 0528 crashed

Open mrgaolei opened this issue 3 weeks ago • 9 comments

Reminder

  • [x] I have read the above rules and searched the existing issues.

System Info

kt-kernel 0.4.1, Ubuntu 24.04. This server ran ktransformers 0.3.2 success. And I create new conda env to run kt-kernel 0.4.1. The new conda env with kt-kernel 0.4.1 conver cpu-weights and run Qwen-30b success But I tried to conver DeepSeek R1 0528 to cpu weights failed.

Reproduction

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4

when procee to layer 55, it crashed and return:

Processing layer 55 (53/59)...
Converting layer 55 with 256 experts via online quantization...
  Loaded weights shapes:
    gate_proj: torch.Size([256, 2048, 7168])
    up_proj: torch.Size([256, 2048, 7168])
    down_proj: torch.Size([256, 7168, 2048])
TP MOE layer 55, pool: 0x4019aca0, expert num: 256, num_experts_per_tok: 8
Creating AMX_MOE_TP 1 at numa 0
Creating AMX_MOE_TP 0 at numa 0
Creating "/opt/ai-models/r1/DeepSeek-R1-0528-CPU/_layer_55/_numa_1"Creating
"/opt/ai-models/r1/DeepSeek-R1-0528-CPU/_layer_55/_numa_0"
alloc 1 from other numa for 7160d0052660
From BF16
段错误 (核心已转储)

Error message so less, it's this script have any log settings?

Others

I found the memory add and add, the error seems OOM, this server has 768G RAM, did it enough? How big memory need to conver DeepSeek R1 671b?

mrgaolei avatar Nov 18 '25 05:11 mrgaolei

For systems with insufficient memory to complete full model quantization, use the --no-merge-safetensor flag to keep weights in layer folder structure without merging into safetensor files:

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4 \
  --no-merge-safetensor

ovowei avatar Nov 18 '25 08:11 ovowei

对于内存不足以完成完整模型量化的系统,可以使用 --no-merge-safetensor 标志,将权重保持在层文件夹结构中,而不合并到安全张量文件中:

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4 \
  --no-merge-safetensor

Thanks very much. And there's another question, Can I convert cpu weight from other 1T memory server, then copy to my server, can it be run?

So, does the cpu-weight file itself contain CPU architecture information? Can it be used on all x86 instruction set devices after a single conversion? If so, why is there no one offering a direct download?

After upgrading to kt 0.4.1, the model files are completely incompatible with those in 0.3.2, right?

mrgaolei avatar Nov 18 '25 09:11 mrgaolei

For systems with insufficient memory to complete full model quantization, use the --no-merge-safetensor flag to keep weights in layer folder structure without merging into safetensor files:

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4 \
  --no-merge-safetensor

Sorry I use this --no-merge-safetensor and 1T RAM server, crashed earlier,

by the way, 我也爱玩空洞骑士^_^

Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_45/_numa_1"
Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_45/_numa_0"
From BF16
  online quant from bf16
  Layer 45 quantized and saved in 13.16s
  Keeping layer folder structure at /mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_45
Processing layer 46 (44/59)...
Converting layer 46 with 256 experts via online quantization...
  Loaded weights shapes:
    gate_proj: torch.Size([256, 2048, 7168])
    up_proj: torch.Size([256, 2048, 7168])
    down_proj: torch.Size([256, 7168, 2048])
TP MOE layer 46, pool: 0xc116860, expert num: 256, num_experts_per_tok: 8
Creating AMX_MOE_TP 1 at numa 1
Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_46/_numa_1"
Creating AMX_MOE_TP 0 at numa 0
Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_46/_numa_0"
From BF16
段错误 (核心已转储)

mrgaolei avatar Nov 18 '25 09:11 mrgaolei

For systems with insufficient memory to complete full model quantization, use the --no-merge-safetensor flag to keep weights in layer folder structure without merging into safetensor files:

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4 \
  --no-merge-safetensor

Sorry I use this --no-merge-safetensor and 1T RAM server, crashed earlier,

by the way, 我也爱玩空洞骑士^_^

Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_45/_numa_1"
Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_45/_numa_0"
From BF16
  online quant from bf16
  Layer 45 quantized and saved in 13.16s
  Keeping layer folder structure at /mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_45
Processing layer 46 (44/59)...
Converting layer 46 with 256 experts via online quantization...
  Loaded weights shapes:
    gate_proj: torch.Size([256, 2048, 7168])
    up_proj: torch.Size([256, 2048, 7168])
    down_proj: torch.Size([256, 7168, 2048])
TP MOE layer 46, pool: 0xc116860, expert num: 256, num_experts_per_tok: 8
Creating AMX_MOE_TP 1 at numa 1
Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_46/_numa_1"
Creating AMX_MOE_TP 0 at numa 0
Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_46/_numa_0"
From BF16
段错误 (核心已转储)

You may need to check the number of NUMA nodes on a 1TB RAM server using lscpu. If it has 4 NUMA nodes, you need to add --threadpool-count 4 when converting the CPU weights. After the conversion, the server you use should also be set to 4 NUMA nodes in BIOS.

By the way, I converted the int4 weights of Deepseek R1 on a 768GB RAM server and use them on a 384GB server.

Here are my conversion parameters:

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type fp8 \
  --output /path/to/output \
  --quant-method int4 \
  --cpuinfer-threads 60 \
  --threadpool-count 4 \
  --no-merge-safetensor

raidshoebox1 avatar Nov 18 '25 12:11 raidshoebox1

Would it be possible to also officially add a flag to resume conversion on a specific layer?

When I monkeypatched the script myself last month, even without merging the safetensors and saving at the end, the memory usage would still grow and eventually OOM on a 768gb RAM server. However, I ended up hacking it to resume on the layer where it crashed and managed to convert successfully.

DocShotgun avatar Nov 18 '25 21:11 DocShotgun

Would it be possible to also officially add a flag to resume conversion on a specific layer?

When I monkeypatched the script myself last month, even without merging the safetensors and saving at the end, the memory usage would still grow and eventually OOM on a 768gb RAM server. However, I ended up hacking it to resume on the layer where it crashed and managed to convert successfully. @DocShotgun I think this is possible but we did not implement it. If you implemented it, you can start a PR. We will be very thankful about it!

ErvinXie avatar Nov 19 '25 07:11 ErvinXie

Would it be possible to also officially add a flag to resume conversion on a specific layer?

When I monkeypatched the script myself last month, even without merging the safetensors and saving at the end, the memory usage would still grow and eventually OOM on a 768gb RAM server. However, I ended up hacking it to resume on the layer where it crashed and managed to convert successfully.

Good. Could you add this support by a PR? Or we may have to find some time to implement it later?

KMSorSMS avatar Nov 20 '25 09:11 KMSorSMS

I started a PR at https://github.com/kvcache-ai/ktransformers/pull/1630

Not sure how you feel about it printing every skipped layer lol

DocShotgun avatar Nov 20 '25 18:11 DocShotgun

I started a PR at #1630

Not sure how you feel about it printing every skipped layer lol

Good job. How about setting this printing as an option? (default to false so that it won't print if not specified)

KMSorSMS avatar Nov 21 '25 03:11 KMSorSMS

#1630

KMSorSMS avatar Nov 22 '25 13:11 KMSorSMS