sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] Segmentation Fault on AMD MI300X

Open kimbochen opened this issue 7 months ago • 3 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [x] 5. Please use English, otherwise it will be closed.

Describe the bug

I am unable to run online benchmarks on the new AMD SGLang Docker image due to a segmentation fault error.

  • Hardware: MI300X
  • Docker image: rocm/sgl-dev:upstream_20250324

I am attaching the error log because it exceeds the character limit.

segfault_error.log

Reproduction

docker network create bmk-net

docker run --rm -d --network bmk-net --ipc host --name bmk-server     --privileged --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri --device=/dev/mem     --group-add render --cap-add=SYS_PTRACE --security-opt seccomp=unconfined     -v "$PWD/.hf_cache/":/root/hf_cache/ -v "$PWD/.inductor_cache/":/tmp/torchinductor_root/     -e HF_HUB_CACHE=/root/hf_cache/ -e HF_TOKEN="$(cat hf_token.txt)"     rocm/sgl-dev:upstream_20250324     python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --host 0.0.0.0 --port 8000 --tp 8 --trust-remote-code     --chunked-prefill-size 131072

printf "RESULT_FILENAME=%s
" "dsv3_tp8_isl1024_osl4096_c256"
while ! docker logs bmk-server 2>&1 | grep -q "The server is fired up and ready to roll!"; do
    sleep 1
done

docker run --rm -t --network bmk-net --name bmk-client     --privileged --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri --device=/dev/mem     --group-add render --cap-add=SYS_PTRACE --security-opt seccomp=unconfined     -v $PWD:/workspace/ -w /workspace/vllm/benchmarks/ -e HF_TOKEN=$(cat hf_token.txt)     rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250410         python benchmark_serving.py             --model deepseek-ai/DeepSeek-V3 --backend vllm --base-url "http://bmk-server:8000"             --dataset-name "random" --random-input-len 1024 --random-output-len 4096 --random-prefix-len 0             --num-prompts $(( 256 * 10 )) --max-concurrency 256 --request-rate "inf" --ignore-eos             --save-result --result-dir "/workspace/results/" --result-filename "dsv3_tp8_isl1024_osl4096_c256.json" --percentile-metrics "ttft,tpot,itl,e2el"

docker stop bmk-server; docker network rm bmk-net

Environment

Python: 3.12.8 (main, Dec  4 2024, 08:54:12) [GCC 11.4.0]
ROCM available: True
GPU 0,1,2,3,4,5,6,7: AMD Instinct MI300X
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.4
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.3.42131-fa1d09cbd
ROCM Driver Version: 6.10.5
PyTorch: 2.6.0a0+git8d4926e
sgl_kernel: 0.0.5.post3
flashinfer: Module Not Found
triton: 3.2.0
transformers: 4.48.3
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.7.dev2+g113274a0
openai: 1.61.1
anthropic: 0.45.2
decord: 0.6.0
AMD Topology:


============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0
================================== End of ROCm SMI Log ===================================

ulimit soft: 1048576

kimbochen avatar May 02 '25 20:05 kimbochen

@kimbochen , would you please attach the server launch and client commands?

HaiShaw avatar May 06 '25 22:05 HaiShaw

Hello @HaiShaw , thanks for taking a look at this.
The reproduction section contains the full command. Additionally, clone vLLM repo and create folder results/.

git clone https://github.com/ROCm/vllm
mkdir results

kimbochen avatar May 07 '25 14:05 kimbochen

@kimbochen Please use rocm/sgl-dev:upstream_20250422, we will merge and sync up images to lmsys by the end of this month. This issue seems to be related to intermittent AITER issue.

HaiShaw avatar May 11 '25 21:05 HaiShaw

This issue seems to be outdated.

HaiShaw avatar Jun 12 '25 05:06 HaiShaw