ktransformers
ktransformers copied to clipboard
convert_cpu_weights DeepSeek R1 0528 crashed
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
kt-kernel 0.4.1, Ubuntu 24.04. This server ran ktransformers 0.3.2 success. And I create new conda env to run kt-kernel 0.4.1. The new conda env with kt-kernel 0.4.1 conver cpu-weights and run Qwen-30b success But I tried to conver DeepSeek R1 0528 to cpu weights failed.
Reproduction
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type bf16 \
--output /path/to/output \
--quant-method int4
when procee to layer 55, it crashed and return:
Processing layer 55 (53/59)...
Converting layer 55 with 256 experts via online quantization...
Loaded weights shapes:
gate_proj: torch.Size([256, 2048, 7168])
up_proj: torch.Size([256, 2048, 7168])
down_proj: torch.Size([256, 7168, 2048])
TP MOE layer 55, pool: 0x4019aca0, expert num: 256, num_experts_per_tok: 8
Creating AMX_MOE_TP 1 at numa 0
Creating AMX_MOE_TP 0 at numa 0
Creating "/opt/ai-models/r1/DeepSeek-R1-0528-CPU/_layer_55/_numa_1"Creating
"/opt/ai-models/r1/DeepSeek-R1-0528-CPU/_layer_55/_numa_0"
alloc 1 from other numa for 7160d0052660
From BF16
段错误 (核心已转储)
Error message so less, it's this script have any log settings?
Others
I found the memory add and add, the error seems OOM, this server has 768G RAM, did it enough? How big memory need to conver DeepSeek R1 671b?
For systems with insufficient memory to complete full model quantization, use the --no-merge-safetensor flag to keep weights in layer folder structure without merging into safetensor files:
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type bf16 \
--output /path/to/output \
--quant-method int4 \
--no-merge-safetensor
对于内存不足以完成完整模型量化的系统,可以使用 --no-merge-safetensor 标志,将权重保持在层文件夹结构中,而不合并到安全张量文件中:
python scripts/convert_cpu_weights.py \ --input-path /path/to/model \ --input-type bf16 \ --output /path/to/output \ --quant-method int4 \ --no-merge-safetensor
Thanks very much. And there's another question, Can I convert cpu weight from other 1T memory server, then copy to my server, can it be run?
So, does the cpu-weight file itself contain CPU architecture information? Can it be used on all x86 instruction set devices after a single conversion? If so, why is there no one offering a direct download?
After upgrading to kt 0.4.1, the model files are completely incompatible with those in 0.3.2, right?
For systems with insufficient memory to complete full model quantization, use the --no-merge-safetensor flag to keep weights in layer folder structure without merging into safetensor files:
python scripts/convert_cpu_weights.py \ --input-path /path/to/model \ --input-type bf16 \ --output /path/to/output \ --quant-method int4 \ --no-merge-safetensor
Sorry I use this --no-merge-safetensor and 1T RAM server, crashed earlier,
by the way, 我也爱玩空洞骑士^_^
Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_45/_numa_1"
Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_45/_numa_0"
From BF16
online quant from bf16
Layer 45 quantized and saved in 13.16s
Keeping layer folder structure at /mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_45
Processing layer 46 (44/59)...
Converting layer 46 with 256 experts via online quantization...
Loaded weights shapes:
gate_proj: torch.Size([256, 2048, 7168])
up_proj: torch.Size([256, 2048, 7168])
down_proj: torch.Size([256, 7168, 2048])
TP MOE layer 46, pool: 0xc116860, expert num: 256, num_experts_per_tok: 8
Creating AMX_MOE_TP 1 at numa 1
Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_46/_numa_1"
Creating AMX_MOE_TP 0 at numa 0
Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_46/_numa_0"
From BF16
段错误 (核心已转储)
For systems with insufficient memory to complete full model quantization, use the --no-merge-safetensor flag to keep weights in layer folder structure without merging into safetensor files:
python scripts/convert_cpu_weights.py \ --input-path /path/to/model \ --input-type bf16 \ --output /path/to/output \ --quant-method int4 \ --no-merge-safetensorSorry I use this
--no-merge-safetensorand 1T RAM server, crashed earlier,by the way, 我也爱玩空洞骑士^_^
Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_45/_numa_1" Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_45/_numa_0" From BF16 online quant from bf16 Layer 45 quantized and saved in 13.16s Keeping layer folder structure at /mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_45 Processing layer 46 (44/59)... Converting layer 46 with 256 experts via online quantization... Loaded weights shapes: gate_proj: torch.Size([256, 2048, 7168]) up_proj: torch.Size([256, 2048, 7168]) down_proj: torch.Size([256, 7168, 2048]) TP MOE layer 46, pool: 0xc116860, expert num: 256, num_experts_per_tok: 8 Creating AMX_MOE_TP 1 at numa 1 Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_46/_numa_1" Creating AMX_MOE_TP 0 at numa 0 Creating "/mnt/nvme1/DeepSeek-R1-0528-CPU/_layer_46/_numa_0" From BF16 段错误 (核心已转储)
You may need to check the number of NUMA nodes on a 1TB RAM server using lscpu. If it has 4 NUMA nodes, you need to add --threadpool-count 4 when converting the CPU weights. After the conversion, the server you use should also be set to 4 NUMA nodes in BIOS.
By the way, I converted the int4 weights of Deepseek R1 on a 768GB RAM server and use them on a 384GB server.
Here are my conversion parameters:
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type fp8 \
--output /path/to/output \
--quant-method int4 \
--cpuinfer-threads 60 \
--threadpool-count 4 \
--no-merge-safetensor
Would it be possible to also officially add a flag to resume conversion on a specific layer?
When I monkeypatched the script myself last month, even without merging the safetensors and saving at the end, the memory usage would still grow and eventually OOM on a 768gb RAM server. However, I ended up hacking it to resume on the layer where it crashed and managed to convert successfully.
Would it be possible to also officially add a flag to resume conversion on a specific layer?
When I monkeypatched the script myself last month, even without merging the safetensors and saving at the end, the memory usage would still grow and eventually OOM on a 768gb RAM server. However, I ended up hacking it to resume on the layer where it crashed and managed to convert successfully. @DocShotgun I think this is possible but we did not implement it. If you implemented it, you can start a PR. We will be very thankful about it!
Would it be possible to also officially add a flag to resume conversion on a specific layer?
When I monkeypatched the script myself last month, even without merging the safetensors and saving at the end, the memory usage would still grow and eventually OOM on a 768gb RAM server. However, I ended up hacking it to resume on the layer where it crashed and managed to convert successfully.
Good. Could you add this support by a PR? Or we may have to find some time to implement it later?
I started a PR at https://github.com/kvcache-ai/ktransformers/pull/1630
Not sure how you feel about it printing every skipped layer lol
I started a PR at #1630
Not sure how you feel about it printing every skipped layer lol
Good job. How about setting this printing as an option? (default to false so that it won't print if not specified)
#1630