Baizhou Zhang comments

Results 79 comments of


                                            Baizhou Zhang

enable flashinfer fp8 gemm if deepgemm disabled

But will this block the usage of other fp8 kernels, like the cutlass one?

enable flashinfer fp8 gemm if deepgemm disabled

> @Fridge003 No it still works, you can still enable it through `CUTLASS_BLOCK_FP8_SUPPORTED` (like before) manually But when I only add flag `CUTLASS_BLOCK_FP8_SUPPORTED`, the flashinfer gemm will also be enabled....

[Bug] Crash when run with lora.

Hi @yinjiaoyuan, could you please paste the `adaptor_config.json` of the lora adaptor you use, so we can inspect into this bug better. If your adaptor only contains modules for gate...

[Bug] Crash when run with lora.

Hi @yinjiaoyuan , could you please pull the latest main branch and try again? #3652 might have fixed this bug. Please tell me if this bug still cannot be fixed...

[Bug] Crash when run with lora.

@yinjiaoyuan Thanks for reporting, could you please describe what's your command to reproduce this bug? Also what's the lengths of the prompts you use?

[Bug] Crash when run with lora.

@yinjiaoyuan Thanks, we will inspect into this bug.

[Bug] Crash when run with lora.

@yinjiaoyuan Sorry, we failed to reproduce your bug. Could you please print the input shapes of the triton kernel that triggers error? This will help a lot, thanks.

[Bug] Crash when run with lora.

@yinjiaoyuan , please insert `print(x.shape, qkv_lora_a.shape, self.batch_info)` at line 52 before calling `sgemm_lora_a_fwd` in file `python/sglang/srt/lora/backend/triton_backend.py` as in the figure. And please paste the result of printing.

[Bug] Crash when run with lora.

The issue seems to lie in overflow of detokenizer. Maybe you can remove the `--dtype float16` in command, pull the latest branch of sglang and try again?