[Bug] [ROCm] Running DeepSeek V3 on MI300X, getting "Config not found, Performance might be sub-optimal" error
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
I am running DeepSeek v3 on a node with 8xMI300X GPUs on ROCm 6.3.1. I am able to run it using an image built from Dockerfile.rocm in docker, however I have noticed this warning show up:
Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at <multiple config files>
In the container built from Dockerfile.rocm, with SGLang v0.4.2, these are the missing config files:
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=24576,K=1536,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=16384,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=32768,K=512,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json
I also tried the Docker image lmsysorg/sglang:v0.4.1.post4-rocm620, based on this blog from AMD. This had SGLang v0.4.1, and was missing the following config files:
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=1536,K=7168,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=3072,K=1536,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=2048,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4608,K=7168,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=2304,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=512,K=7168,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=256,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json
/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=256,N=256,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json
Reproduction
Reproduction steps: I launched the container with the command:
docker run -it --network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
--shm-size 16G \
-p 8080:8080 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
v0.4.2-rocm620:latest
And I ran the server using the command
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 8080
Environment
ROCm 6.3.1 8xMI300X
Docker images:
v0.4.2-rocm620:latestbuilt fromdocker/Dockerfile.rocmand build instructions from SGLang docslmsysorg/sglang:v0.4.1.post4-rocm620
Thanks for the AMD related issue. We will find people to solve and add docs on AMD machines.
This doesn't seem to happen anymore with lmsysorg/sglang:v0.4.2.post2-rocm630
It seems the work to do the moe config kernel tuning is now live and that is very exciting! I get a whopping 28 tokens/s with this newest release! However, it's completely garbled text output and it seems to be loading in full bf16 (with the same launch command as OP) instead of respecting the quantization config and loading fp8 w8a8. I think this is likely where the garbled text is coming from. See this output:
[2025-02-05 15:49:20 TP6] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=104.22 GB
[2025-02-05 15:49:20 TP7] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=104.24 GB
[2025-02-05 15:49:20 TP0] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=104.81 GB
[2025-02-05 15:49:20 TP2] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=103.97 GB
[2025-02-05 15:49:20 TP5] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=104.26 GB
[2025-02-05 15:49:20 TP1] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=103.98 GB
[2025-02-05 15:49:20 TP3] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=103.99 GB
[2025-02-05 15:49:20 TP4] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=104.09 GB
...
[2025-02-05 15:49:24 TP3] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=3072,K=1536,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
It's like the config still thinks it's fp8 (because it should be) but sglang loads in at bf16. I'm using ~84% of memory instead of the ~48% that non-upcast dsv3/r1 weights should take. This might be worthy of a different issue being opened but I just wanted to get it jotted down in case I move on to working on something else. I'll troubleshoot a bit longer to see what might be the cause of the incorrect dtype loading and open a new issue if I find enough to know what's going on.
CC: @HaiShaw and @BruceXcluding in case ya'll are still working on it.
Also I'm available for testing for the forseeable future on the mi300x if anyone at sglang or amd wants me to.
8 x MI300X node here. Just built from main branch, now also getting 28 TPS per request, with no garbled text, no fp16 issue! 🐋 It decreases down to 5 TPS with a full context window. Both of these numbers are >2x as fast as when I previously built against v0.4.2-rocm620. Amazing.
How I built the SGLang container (CC @AdjectiveAllison):
git clone -b main https://github.com/sgl-project/sglang.git
cd sglang
cd docker
docker build --build-arg SGL_BRANCH=main -t main -f Dockerfile.rocm .
I'm running with this docker-compose.yml:
services:
sglang-server:
image: main
restart: always
command: python3 -m sglang.launch_server --model-path /other-big-ssd/DeepSeek-R1 --served-model-name "DeepSeek-R1" --tp 8 --mem-fraction-static 0.5 --trust-remote-code --host 0.0.0.0 --port 8000
network_mode: host
privileged: true
devices:
- "/dev/kfd:/dev/kfd"
- "/dev/dri:/dev/dri"
ipc: host
shm_size: 32G
group_add:
- "video"
cap_add:
- SYS_PTRACE
security_opt:
- seccomp=unconfined
volumes:
- "${HOME}/dockerx:/dockerx"
- "/data:/data"
- "/media/other-big-ssd:/other-big-ssd"
Appreciate the continued efforts of the SGLang and AMD teams!
Hello @c-mart @AdjectiveAllison and folks , I'm trying to deploy DeepSeek-R1 on MI300X as well, but I keep encountering this error. Have you experienced this too?
...
[2025-02-07 05:35:21 TP6] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/user/deepseek/sglang/python/sglang/srt/layers/quantization/configs/N=1536,K=7168,device_name=AMD_Instinct_MI300X_VF,dtype=fp8_w8a8,block_shape=[128, 128].json
....
File "/home/user/.conda/envs/py312/lib/python3.12/site-packages/triton/compiler/compiler.py", line 395, in __getattribute__
self._init_handles()
File "/home/user/.conda/envs/py312/lib/python3.12/site-packages/triton/compiler/compiler.py", line 388, in _init_handles
raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 212992, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.
This issue appears to be related: pytorch/issues/133254 It seems to involve block size configuration. However, I can't find a way to set it. Do you have any insights on this?
I built and installed sglang from source (with the latest main branch). And my command is :
python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 \
--model-path ../DeepSeek-R1/ \
--tp 8 --trust-remote-code \
--quantization fp8 \
--mem-fraction-static 0.5 \
--context-length 4096 \
--served-model-name "DeepSeek-R1"
Envs:
pytorch-triton-rocm 3.2.0
torch 2.6.0+rocm6.1
sgl-kernel 0.0.3.post1
sglang 0.4.2.post2
$ apt show rocm-libs
Package: rocm-libs
Version: 6.1.0.60100-82~20.04
Hello @c-mart @AdjectiveAllison and folks , I'm trying to deploy DeepSeek-R1 on MI300X as well, but I keep encountering this error. Have you experienced this too?
... [2025-02-07 05:35:21 TP6] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/user/deepseek/sglang/python/sglang/srt/layers/quantization/configs/N=1536,K=7168,device_name=AMD_Instinct_MI300X_VF,dtype=fp8_w8a8,block_shape=[128, 128].json .... File "/home/user/.conda/envs/py312/lib/python3.12/site-packages/triton/compiler/compiler.py", line 395, in __getattribute__ self._init_handles() File "/home/user/.conda/envs/py312/lib/python3.12/site-packages/triton/compiler/compiler.py", line 388, in _init_handles raise OutOfResources(self.metadata.shared, max_shared, "shared memory") triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 212992, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.This issue appears to be related: pytorch/issues/133254 It seems to involve block size configuration. However, I can't find a way to set it. Do you have any insights on this?
I built and installed sglang from source (with the latest main branch). And my command is :
python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 \ --model-path ../DeepSeek-R1/ \ --tp 8 --trust-remote-code \ --quantization fp8 \ --mem-fraction-static 0.5 \ --context-length 4096 \ --served-model-name "DeepSeek-R1"Envs:
pytorch-triton-rocm 3.2.0 torch 2.6.0+rocm6.1 sgl-kernel 0.0.3.post1 sglang 0.4.2.post2 $ apt show rocm-libs Package: rocm-libs Version: 6.1.0.60100-82~20.04
Can you show your Vendor Name with command rocminfo on local system (out of docker)
Hello @c-mart @AdjectiveAllison and folks , I'm trying to deploy DeepSeek-R1 on MI300X as well, but I keep encountering this error. Have you experienced this too?
... [2025-02-07 05:35:21 TP6] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/user/deepseek/sglang/python/sglang/srt/layers/quantization/configs/N=1536,K=7168,device_name=AMD_Instinct_MI300X_VF,dtype=fp8_w8a8,block_shape=[128, 128].json .... File "/home/user/.conda/envs/py312/lib/python3.12/site-packages/triton/compiler/compiler.py", line 395, in __getattribute__ self._init_handles() File "/home/user/.conda/envs/py312/lib/python3.12/site-packages/triton/compiler/compiler.py", line 388, in _init_handles raise OutOfResources(self.metadata.shared, max_shared, "shared memory") triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 212992, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.This issue appears to be related: pytorch/issues/133254 It seems to involve block size configuration. However, I can't find a way to set it. Do you have any insights on this? I built and installed sglang from source (with the latest main branch). And my command is :
python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 \ --model-path ../DeepSeek-R1/ \ --tp 8 --trust-remote-code \ --quantization fp8 \ --mem-fraction-static 0.5 \ --context-length 4096 \ --served-model-name "DeepSeek-R1"Envs:
pytorch-triton-rocm 3.2.0 torch 2.6.0+rocm6.1 sgl-kernel 0.0.3.post1 sglang 0.4.2.post2 $ apt show rocm-libs Package: rocm-libs Version: 6.1.0.60100-82~20.04Can you show your Vendor Name with command
rocminfoon local system (out of docker)
I can't access the local system, this is from inside container: (2 CPU 8 GPU, this is the last gpu) Agent 10
Name: gfx942 Uuid: GPU-4faf39080fa52e05 Marketing Name: AMD Instinct MI300X VF Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40)
Hello @c-mart @AdjectiveAllison and folks , I'm trying to deploy DeepSeek-R1 on MI300X as well, but I keep encountering this error. Have you experienced this too?
... [2025-02-07 05:35:21 TP6] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/user/deepseek/sglang/python/sglang/srt/layers/quantization/configs/N=1536,K=7168,device_name=AMD_Instinct_MI300X_VF,dtype=fp8_w8a8,block_shape=[128, 128].json .... File "/home/user/.conda/envs/py312/lib/python3.12/site-packages/triton/compiler/compiler.py", line 395, in __getattribute__ self._init_handles() File "/home/user/.conda/envs/py312/lib/python3.12/site-packages/triton/compiler/compiler.py", line 388, in _init_handles raise OutOfResources(self.metadata.shared, max_shared, "shared memory") triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 212992, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.This issue appears to be related: pytorch/issues/133254 It seems to involve block size configuration. However, I can't find a way to set it. Do you have any insights on this? I built and installed sglang from source (with the latest main branch). And my command is :
python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 \ --model-path ../DeepSeek-R1/ \ --tp 8 --trust-remote-code \ --quantization fp8 \ --mem-fraction-static 0.5 \ --context-length 4096 \ --served-model-name "DeepSeek-R1"Envs:
pytorch-triton-rocm 3.2.0 torch 2.6.0+rocm6.1 sgl-kernel 0.0.3.post1 sglang 0.4.2.post2 $ apt show rocm-libs Package: rocm-libs Version: 6.1.0.60100-82~20.04Can you show your Vendor Name with command
rocminfoon local system (out of docker)I can't access the local system, this is from inside container: (2 CPU 8 GPU, this is the last gpu) Agent 10
Name: gfx942 Uuid: GPU-4faf39080fa52e05 Marketing Name: AMD Instinct MI300X VF Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40)
okay, it looks like you are using Virtual GPU, you can change all config files name from ...device_name=AMD_Instinct_MI300X....json to ...device_name=AMD_Instinct_MI300X_VF....json directly. I don't know if your single gpu has the same 192GB mem as physical one, since you had the out of memory error. You can check the VRAM_TOTAL with the command amd-smi monitor to see if you have enough mem. if not, you can change tp 8 to tp 32 or else (I guess single card may be cut into quarters).
Hello @c-mart @AdjectiveAllison and folks , I'm trying to deploy DeepSeek-R1 on MI300X as well, but I keep encountering this error. Have you experienced this too?
... [2025-02-07 05:35:21 TP6] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/user/deepseek/sglang/python/sglang/srt/layers/quantization/configs/N=1536,K=7168,device_name=AMD_Instinct_MI300X_VF,dtype=fp8_w8a8,block_shape=[128, 128].json .... File "/home/user/.conda/envs/py312/lib/python3.12/site-packages/triton/compiler/compiler.py", line 395, in __getattribute__ self._init_handles() File "/home/user/.conda/envs/py312/lib/python3.12/site-packages/triton/compiler/compiler.py", line 388, in _init_handles raise OutOfResources(self.metadata.shared, max_shared, "shared memory") triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 212992, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.This issue appears to be related: pytorch/issues/133254 It seems to involve block size configuration. However, I can't find a way to set it. Do you have any insights on this? I built and installed sglang from source (with the latest main branch). And my command is :
python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 \ --model-path ../DeepSeek-R1/ \ --tp 8 --trust-remote-code \ --quantization fp8 \ --mem-fraction-static 0.5 \ --context-length 4096 \ --served-model-name "DeepSeek-R1"Envs:
pytorch-triton-rocm 3.2.0 torch 2.6.0+rocm6.1 sgl-kernel 0.0.3.post1 sglang 0.4.2.post2 $ apt show rocm-libs Package: rocm-libs Version: 6.1.0.60100-82~20.04Can you show your Vendor Name with command
rocminfoon local system (out of docker)I can't access the local system, this is from inside container: (2 CPU 8 GPU, this is the last gpu) Agent 10 Name: gfx942 Uuid: GPU-4faf39080fa52e05 Marketing Name: AMD Instinct MI300X VF Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40)
okay, it looks like you are using Virtual GPU, you can change all config files name from
...device_name=AMD_Instinct_MI300X....jsonto...device_name=AMD_Instinct_MI300X_VF....jsondirectly. I don't know if your single gpu has the same 192GB mem as physical one, since you had the out of memory error. You can check theVRAM_TOTALwith the commandamd-smi monitorto see if you have enough mem. if not, you can change tp 8 to tp 32 or else (I guess single card may be cut into quarters).
Great! I've copied and renamed the configuration files, and the warnings are now resolved. The server is running, but it fails as soon as it receives a request like this:
curl -s http://localhost:30000/v1/chat/completions \
-d '{"model": "DeepSeek-R1", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
The log was like:
[2025-02-07 08:15:00 TP6] max_total_num_tokens=178403, chunked_prefill_size=8192, max_prefill_tokens=8192, max_running_requests=4097, context_len=4096
[2025-02-07 08:15:01] INFO: Started server process [668602]
[2025-02-07 08:15:01] INFO: Waiting for application startup.
[2025-02-07 08:15:01] INFO: Application startup complete.
[2025-02-07 08:15:01] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-02-07 08:15:02] INFO: 127.0.0.1:43826 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-07 08:15:02 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-07 08:15:04 TP1] Using configuration from /home/user/deepseek/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=AMD_Instinct_MI300X_VF,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
...
[2025-02-07 08:15:04 TP7] Using configuration from /home/user/deepseek/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=AMD_Instinct_MI300X_VF,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-07 08:17:14] INFO: 127.0.0.1:46620 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-07 08:22:31 TP2] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-07 08:22:31 TP0] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-07 08:22:31 TP7] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-07 08:22:31 TP3] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-07 08:22:31 TP5] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-07 08:22:36] Received sigquit from a child proces. It usually means the child failed.
[2025-02-07 08:22:36] Received sigquit from a child proces. It usually means the child failed.
[2025-02-07 08:22:36] Received sigquit from a child proces. It usually means the child failed.
Hi @BruceXcluding , do you know anywhere i can check the server logs about why it failed? Already set --log-level debug --log-level-http debug --log-requests, but with no luck.
@c-mart I also built with the v0.4.3-rocm630 version and seeing >2x results improvement over the v0.4.2 version. The original "config not found" issue also seems to be resolved, appreciate the work of the SGLang team!
Are the lmsysorg images built from the same process and Dockerfile or a different one? We are getting reports from our customers that they are getting garbled text from the prebuilt images, but building images locally seems to be working fine
cc @HaiShaw
Thanks all for the updates, I too was able to get roughly double the performance after building a new container image from the main branch.
However, because I'm running on Azure, I also ran into the AMD_Instinct_MI300X -> AMD_Instinct_MI300X_VF file name issue. Would it be possible to create symlinks for those files in the relevant directories in the container or source? I believe they are all in /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ and /sgl-workspace/sglang/python/srt/layers/moe/fused_moe_triton/configs/.
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.