sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] H20 deepseek infer enable flashinfer mla hang

Open ProphetPeng opened this issue 11 months ago • 6 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [ ] 5. Please use English, otherwise it will be closed.

Describe the bug

use sglang (v0.4.3) for deepseek-r1 on 8 H20, enable flashinfer mla, it hangs when flashinfer loading jit ops.

Reproduction

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --trust-remote-code --enable-flashinfer-mla --disable-radix-cache --tp 8

Environment

8 H20

ProphetPeng avatar Feb 14 '25 09:02 ProphetPeng

Try to install the latest version of flashinfer and remove ~/.cache/flashinfer.

ispobock avatar Feb 14 '25 09:02 ispobock

ref https://github.com/flashinfer-ai/flashinfer/issues/825#issuecomment-2658773255

zhyncs avatar Feb 14 '25 09:02 zhyncs

@ProphetPeng hi, Is there anything to update? I meet the same problem (sglang 0.4.3 + flashinfer 0.2.1.post1) @ispobock hi, I try to reinstall flashinfer 0.2.1.post1 but the problem still exist

ICENacl avatar Feb 15 '25 08:02 ICENacl

Try to install the latest version of flashinfer and remove ~/.cache/flashinfer.

Thanks, it works for me. But it's slower than triton kernel with H20 for short context.

ProphetPeng avatar Feb 15 '25 10:02 ProphetPeng

Does flashinfer has better performance?

YangZeyu95 avatar Feb 16 '25 03:02 YangZeyu95

Does flashinfer has better performance?

Because the radix cache is disabled in the current version, performance will be reduced for general input and output. Flashinfer is effective for long context input scenarios and will improve throughput. @YangZeyu95

lambert0312 avatar Feb 17 '25 00:02 lambert0312

@lambert0312 @ProphetPeng Please try pulling the latest main branch, now --enable-flashinfer-mla and radix cache can be used together.

Fridge003 avatar Feb 26 '25 20:02 Fridge003

@lambert0312 @ProphetPeng Please try pulling the latest main branch, now --enable-flashinfer-mla and radix cache can be used together.

@Fridge003 I've verified it, no problem.

lambert0312 avatar Mar 02 '25 09:03 lambert0312