sglang issues

is_multimodal and custom model_path

1

At the moment whether the model [is multimodal](https://github.com/sgl-project/sglang/blob/b0b722ee8e90bfa2b379eadb1432e2f6852a6ad0/python/sglang/srt/managers/tokenizer_manager.py#L99) decided only based on the model path variable. This leads to an issue when the model path does not contain the right...

aismlv

Prefill out of memory occur when deployed with servers

3

I use 4 A6000 to deploy Qwen1.5-72B-Chat. The command I start server is `CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server --tp-size 4 --model-path Qwen1.5-72B-Chat --port 8991 --context-length 16000`. During inference, I encounter ```...

for-just-we

Regenerate benchmark results for latest vLLM

1

It appears that the benchmark plots are from a much older version of vLLM (more than 4 months old https://github.com/vllm-project/vllm/releases/tag/v0.2.5). With the latest improvements (e.g. automatic prefix caching), the numbers...

nilesh-c

Randomly getting output full of repetitive newlines on LLaVA invocations

1

Not sure if anyone else had hit into this but when using `liuhaotian/llava-v1.5-13b` with `llava-hf/llava-1.5-13b-hf` tokenizer, randomly I get outputs full of only newlines. The frequency of this happening increases...

isidentical

compatibility issues and memory leak problems --enable-flashinfer

3

Version: sglang==0.1.14 Hardware: ec2 g5.xlarge Hi, when using the following line: ```python3 python sglang.launch_server --model-path openchat/openchat-3.5-0106 --port 30000 --mem-fraction-static 0.8 --enable-flashinfer ``` So, I notice two problems when running the...

pj-ml

[BUG] srt throws KeyError when sgl.gen(...) regex parameter contains Chinese characters

1

It seems that `sgl.gen(regex=)` doesn't take Chinese characters. Error Details ``` Exception in ModelRpcClient: Traceback (most recent call last): File ".../sglang/python/sglang/srt/managers/router/model_rpc.py", line 175, in exposed_step self.handle_generate_request(recv_req) File ".../sglang/python/sglang/srt/managers/router/model_rpc.py", line 271,...

m0g1cian

ImportError: cannot import name 'get_cuda_stream' from 'triton.runtime.jit' In triton-nightly(V100)

### Description Encountered an ImportError when attempting to start a project using `triton-nightly` on a V100 GPU. The issue seems to stem from an inability to import `get_cuda_stream` from `triton.runtime.jit`...

JungHoyoun

v100

VLLM version

3

`python -m sglang.launch_server --model-path Mistral-7B-Instruct-v0.2/` fails with ``` router init state: Traceback (most recent call last): File ".venv/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process model_client = ModelRpcClient(server_args, port_args) File ".venv/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 619,...

eaubin

high priority

How does or does sglang support multiple completions / samples given the same prompt?

Thanks for the amazing project! I was wondering if sglang supports multiple completions / samples given the same prompt? Like similar to the num_return_sequences parameter of HF Generation. By looking...

wenting-zhao

Loading Chat Template in a more flexible way?

The Chat models like [codellama-instruct](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf/blob/main/tokenizer_config.json), [qwen](https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat/file/view/master?fileName=tokenizer_config.json&status=1) all have a `chat_template` field in the JSON which defines the chat template of the model. But I notice it seems that sglang currently...

for-just-we

good first issue

sglang
sglang copied to clipboard

Metadata

is_multimodal and custom model_path

Prefill out of memory occur when deployed with servers

Regenerate benchmark results for latest vLLM

Randomly getting output full of repetitive newlines on LLaVA invocations

compatibility issues and memory leak problems --enable-flashinfer

[BUG] srt throws KeyError when sgl.gen(...) regex parameter contains Chinese characters

ImportError: cannot import name 'get_cuda_stream' from 'triton.runtime.jit' In triton-nightly(V100)

VLLM version

How does or does sglang support multiple completions / samples given the same prompt?

Loading Chat Template in a more flexible way?

← Metadata

Owner

Metadata

sglang sglang copied to clipboard

Metadata

← Metadata

Owner

Metadata

sglang
sglang copied to clipboard