sglang KeyError: 'model.layers.28.mlp.experts.w2_weight_scale

hi，i am usinig 4 * H20(with 8 cards), and run deepseek-r1. I also get ValueError: Weight output_partition_size = 576 is not divisible by weight quantization block_n = 128.

so i try to remove below on config.json: "quantization_config": { "activation_scheme": "dynamic", "fmt": "e4m3", "quant_method": "fp8", "weight_block_size": [128, 128] },

but I get new error: [2025-02-14 07:48:15 TP6] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1787, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 240, in init self.tp_worker = TpWorkerClass( File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port) File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in init self.model_runner = ModelRunner( File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 186, in init self.load_model() File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 307, in load_model self.model = get_model( File "/sgl-workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model return loader.load_model( File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 362, in load_model model.load_weights(self._get_all_weights(model_config, model)) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 920, in load_weights param = params_dict[name] KeyError: 'model.layers.28.mlp.experts.w2_weight_scale_inv'

anyone know hot to fix it ? thanks

Feb 14 '25 08:02 BoBo0037

I run these 4 nodes like below：

node 1： docker run --gpus all
-it
--rm
--name sglang_node_1
-v /data/deepseek-r1:/root/deepseek-r1
--env HF_ENDPOINT="https://hf-mirror.com"
--env "GLOO_SOCKET_IFNAME=ens12f0np0"
--env "NCCL_SOCKET_IFNAME=ens12f0np0"
--ipc=host
--network=host
--shm-size 32g
registry.cn-shanghai.aliyuncs.com/shengsuan/sglang:latest
python3 -m sglang.launch_server
--model-path /root/deepseek-r1
--dist-init-addr 10.0.251.137:50000
--tp 32
--nnodes 4
--node-rank 0
--trust-remote-code
--enable-cache-report
--log-requests
--host 0.0.0.0
--port 30000
--context-length 64128
--kv-cache-dtype auto
--schedule-conservativeness 0.3
--max-running-requests 1024

node 2 docker run --gpus all
-it
--rm
--name sglang_node_2
-v /deepseek-r1:/root/deepseek-r1
--env HF_ENDPOINT="https://hf-mirror.com"
--env "GLOO_SOCKET_IFNAME=ens12f0np0"
--env "NCCL_SOCKET_IFNAME=ens12f0np0"
--ipc=host
--network=host
--shm-size 32g
registry.cn-shanghai.aliyuncs.com/shengsuan/sglang:latest
python3 -m sglang.launch_server
--model-path /root/deepseek-r1
--dist-init-addr 10.0.251.137:50000
--tp 32
--nnodes 4
--node-rank 1
--trust-remote-code
--enable-cache-report
--log-requests
--context-length 64128
--kv-cache-dtype auto
--schedule-conservativeness 0.3
--max-running-requests 1024

node3 docker run --gpus all
-it
--rm
--name sglang_node_3
-v /models/models--deepseek-ai--DeepSeek-R1:/root/deepseek-r1
--env HF_ENDPOINT="https://hf-mirror.com"
--env "GLOO_SOCKET_IFNAME=ens12f0np0"
--env "NCCL_SOCKET_IFNAME=ens12f0np0"
--ipc=host
--network=host
--shm-size 32g
registry.cn-shanghai.aliyuncs.com/shengsuan/sglang:latest
python3 -m sglang.launch_server
--model-path /root/deepseek-r1
--dist-init-addr 10.0.251.137:50000
--tp 32
--nnodes 4
--node-rank 2
--trust-remote-code
--enable-cache-report
--log-requests
--context-length 64128
--kv-cache-dtype auto
--schedule-conservativeness 0.3
--max-running-requests 1024

node4 docker run --gpus all
-it
--rm
--name sglang_node_4
-v /data/deepseek-r1:/root/deepseek-r1
--env HF_ENDPOINT="https://hf-mirror.com"
--env "GLOO_SOCKET_IFNAME=ens12f0np0"
--env "NCCL_SOCKET_IFNAME=ens12f0np0"
--ipc=host
--network=host
--shm-size 32g
registry.cn-shanghai.aliyuncs.com/shengsuan/sglang:latest
python3 -m sglang.launch_server
--model-path /root/deepseek-r1
--dist-init-addr 10.0.251.137:50000
--tp 32
--nnodes 4
--node-rank 3
--trust-remote-code
--enable-cache-report
--log-requests
--context-length 64128
--kv-cache-dtype auto
--schedule-conservativeness 0.3
--max-running-requests 1024

Feb 14 '25 08:02 BoBo0037

I know H20 support 8bit weights, and A100 does not support 8bit(so need 16bit weight). I just want to know how to avoid Weight output_partition_size = 576 is not divisible by weight quantization block_n = 128 error. because I wanna using whole 32 gpu cards. thanks

Feb 14 '25 09:02 BoBo0037

You want to use 4x8 H20 to run fp8 dsv3? For block fp8, it cannot support 32 GPU for the issue you met, and the max TP size is 16.

Feb 14 '25 10:02 HandH1998

You want to use 4x8 H20 to run fp8 dsv3? For block fp8, it cannot support 32 GPU for the issue you met, and the max TP size is 16.

Thanks for your reply. Does the sglang team have plans to support running fp8 on 4x8 H20 in the future?

Feb 17 '25 02:02 BoBo0037

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Apr 19 '25 00:04 github-actions[bot]

KeyError: 'model.layers.28.mlp.experts.w2_weight_scale_inv' in 4*H20