lullabies777
lullabies777
> Why vLLM `--enable-prefix-caching` but SGLang `--disable-radix-cache`? I added it because I noticed the comment ["Disable RadixAttention for prefix caching."](https://github.com/sgl-project/sglang/blob/55f5976b42d736f3dfe2f8f9b91a6536c212744a/python/sglang/srt/server_args.py) I experimented with both options, but the performance remained nearly...
> ``` > # H100 SXM 80G > # both disable prefix cache > > {"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 1, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53379, "mean_e2e_latency_ms": 1060.0129053120813, "median_e2e_latency_ms":...
> qwen2不太能支持这个vllm 0.6的原生tool call功能。但是我正在让下一版qwen2.5支持这个(希望顺利)。 > > 但是这应该不会影响vllm服务的正常启动(不要开vllm的tool call功能),启动vllm的方式见 https://github.com/QwenLM/Qwen2?tab=readme-ov-file#vllm (如果启动命令没问题、但是有影响,可以试试升级到最新的vllm或降级到0.5.x) 我把vllm降到了0.5.5,貌似还是没有办法复现https://modelscope-agent.readthedocs.io/en/stable/llms/qwen2_tool_calling.html 这里面的教程,是还需要继续降级么。如果希望使用auto tool-choice是降级vllm就可以么,还是vllm本地部署就用不了auto-tool-choice?sglang支持auto-tool嘛,感谢回复!