flashinfer
flashinfer copied to clipboard
Load the qwen2_5_insturcut_7b model, because the deployment request response using sglang is slower than that of vllm deployment. When troubleshooting, print("mutates_args",mutates_args,"schema_str",schema_str) is this result reasonable?
Sorry I don't quite understand the question.
because the deployment request response using sglang is slower than that of vllm deployment.
Do you mean warmup time, or evaluations metrics such as ITL or TTFT?
When troubleshooting, print("mutates_args",mutates_args,"schema_str",schema_str) is this result reasonable?
what's the purpose of printing these arguments?
Old issue that needs more input from @qingzhong1, closing for now. Feel free to re-open if this is still an issue.