sglang icon indicating copy to clipboard operation
sglang copied to clipboard

SGLang is a fast serving framework for large language models and vision language models.

Results 722 sglang issues
Sort by recently updated
recently updated
newest added

At the moment whether the model [is multimodal](https://github.com/sgl-project/sglang/blob/b0b722ee8e90bfa2b379eadb1432e2f6852a6ad0/python/sglang/srt/managers/tokenizer_manager.py#L99) decided only based on the model path variable. This leads to an issue when the model path does not contain the right...

I use 4 A6000 to deploy Qwen1.5-72B-Chat. The command I start server is `CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server --tp-size 4 --model-path Qwen1.5-72B-Chat --port 8991 --context-length 16000`. During inference, I encounter ```...

It appears that the benchmark plots are from a much older version of vLLM (more than 4 months old https://github.com/vllm-project/vllm/releases/tag/v0.2.5). With the latest improvements (e.g. automatic prefix caching), the numbers...

Not sure if anyone else had hit into this but when using `liuhaotian/llava-v1.5-13b` with `llava-hf/llava-1.5-13b-hf` tokenizer, randomly I get outputs full of only newlines. The frequency of this happening increases...

Version: sglang==0.1.14 Hardware: ec2 g5.xlarge Hi, when using the following line: ```python3 python sglang.launch_server --model-path openchat/openchat-3.5-0106 --port 30000 --mem-fraction-static 0.8 --enable-flashinfer ``` So, I notice two problems when running the...

It seems that `sgl.gen(regex=)` doesn't take Chinese characters. Error Details ``` Exception in ModelRpcClient: Traceback (most recent call last): File ".../sglang/python/sglang/srt/managers/router/model_rpc.py", line 175, in exposed_step self.handle_generate_request(recv_req) File ".../sglang/python/sglang/srt/managers/router/model_rpc.py", line 271,...

### Description Encountered an ImportError when attempting to start a project using `triton-nightly` on a V100 GPU. The issue seems to stem from an inability to import `get_cuda_stream` from `triton.runtime.jit`...

v100

`python -m sglang.launch_server --model-path Mistral-7B-Instruct-v0.2/` fails with ``` router init state: Traceback (most recent call last): File ".venv/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process model_client = ModelRpcClient(server_args, port_args) File ".venv/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 619,...

high priority

Thanks for the amazing project! I was wondering if sglang supports multiple completions / samples given the same prompt? Like similar to the num_return_sequences parameter of HF Generation. By looking...

The Chat models like [codellama-instruct](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf/blob/main/tokenizer_config.json), [qwen](https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat/file/view/master?fileName=tokenizer_config.json&status=1) all have a `chat_template` field in the JSON which defines the chat template of the model. But I notice it seems that sglang currently...

good first issue