Baizhou Zhang

Results 79 comments of Baizhou Zhang

But will this block the usage of other fp8 kernels, like the cutlass one?

> @Fridge003 No it still works, you can still enable it through `CUTLASS_BLOCK_FP8_SUPPORTED` (like before) manually But when I only add flag `CUTLASS_BLOCK_FP8_SUPPORTED`, the flashinfer gemm will also be enabled....

Hi @yinjiaoyuan, could you please paste the `adaptor_config.json` of the lora adaptor you use, so we can inspect into this bug better. If your adaptor only contains modules for gate...

Hi @yinjiaoyuan , could you please pull the latest main branch and try again? #3652 might have fixed this bug. Please tell me if this bug still cannot be fixed...

@yinjiaoyuan Thanks for reporting, could you please describe what's your command to reproduce this bug? Also what's the lengths of the prompts you use?

@yinjiaoyuan Thanks, we will inspect into this bug.

@yinjiaoyuan Sorry, we failed to reproduce your bug. Could you please print the input shapes of the triton kernel that triggers error? This will help a lot, thanks.

@yinjiaoyuan , please insert `print(x.shape, qkv_lora_a.shape, self.batch_info)` at line 52 before calling `sgemm_lora_a_fwd` in file `python/sglang/srt/lora/backend/triton_backend.py` as in the figure. And please paste the result of printing.

The issue seems to lie in overflow of detokenizer. Maybe you can remove the `--dtype float16` in command, pull the latest branch of sglang and try again?