sglang issues

Choices functionality breaking with images

When I am asking the model about an image, behaviour seems to break when I use the _choices_ functionality, and it seems to always suggest the first option. I think...

dexius-ram-depop

Import Errors occurring even when dependencies are installed

2

I am unable to create a Runtime with sglang as follows `runtime = sgl.Runtime(model_path=MODEL_DIR, tokenizer_path=MODEL_DIR)`. It throws the error below: ```python ImportError: cannot import name '_set_default_torch_dtype' from 'vllm.model_executor.model_loader' (/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py) ```...

david-vectorflow

Repetitive zeros for structuring JSON (float type)

There has been a number of times, where the float type will get an unlimited amount of zeros being generated. Any idea what could be the cause? I am thinking...

timothylimyl

no batch run when using openai's format for calling.

I just use this command to start the server `CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --model-path LLMs/Qwen-14B-Chat --port 30000 --trust-remote-code --stream-interval 1 --enable-flashinfer --schedule-conservativeness 50` and using the following code to test...

xjw00654

How does RadixAttention implements multi-head/multi-query/grouped-query attention.

how does radix-attention function call need to be modified in sglang for a model implemented in vllm where paged attention takes care of multi-query and grouped-query architecture

Griffintaur

Is it possible to define the prompts for KV caching up-front?

1

For a lot of use cases, there is already a pre-defined system + base prompt that is used. Can we define the KV cache for these prompts up front manually?...

timothylimyl

Don't get API response when sending images

1

I loaded Llava v1.6 34B on my server ``` export DISABLE_NEST_ASYNCIO=True model=liuhaotian/llava-v1.6-34b tokenizer=liuhaotian/llava-v1.6-34b-tokenizer CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --model-path $model --tokenizer-path $tokenizer --port 30813 --tp 2 ``` It works when I...

tom-doerr

Benchmark of First Token Time

1

Hi folks, Where is the code for benchmark of first token time? I only see the average latency :) Thanks,

William394873

Loading a BNB 4 bit model + adapter

1

After doing QLoRA with a training library (unsloth) and saving the adapter, is there a way to load the 4 bit bnb model and the un-merged adapter for use with...

timothelaborie

Supports the InternVL multimodal large model

2

Can it support the InternVL multimodal large model, which currently ranks first in the MMMU open source ranking. [https://github.com/OpenGVLab/InternVL/](https://github.com/OpenGVLab/InternVL/) ![WX20240324-102942@2x](https://github.com/sgl-project/sglang/assets/4583537/2416f85d-5231-4d8c-9255-b598385e6eaa) [MMMU](https://mmmu-benchmark.github.io)

exceedzhang

sglang
sglang copied to clipboard

Metadata

Choices functionality breaking with images

Import Errors occurring even when dependencies are installed

Repetitive zeros for structuring JSON (float type)

no batch run when using openai's format for calling.

How does RadixAttention implements multi-head/multi-query/grouped-query attention.

Is it possible to define the prompts for KV caching up-front?

Don't get API response when sending images

Benchmark of First Token Time

Loading a BNB 4 bit model + adapter

Supports the InternVL multimodal large model

← Metadata

Owner

Metadata

sglang sglang copied to clipboard

Metadata

← Metadata

Owner

Metadata

sglang
sglang copied to clipboard