sglang issues

Colab?

9

Awesome project. We have a paper https://arxiv.org/abs/2310.14034 with really complicated KV caching that I would love to go back and implement in SGLang. I tried to get an example working...

srush

collaboration

Max Length.

2

Is there anyway to truncate text based on tokens? I really like that as a user I don't need to think about tokens. But to save memory I would like...

srush

good first issue

Triton support

7

Hello, curious if we can already use sglang as a backend for NVIDIA's Triton Server. Amazing work with the library btw, love it!

TheodoreGalanos

high priority

Metal support?

2

Hey, when is planned the support for Metal backend?

mitkox

[Quantization Support Request]: Exllamav2

2

Exllamav2 is an excellent quantization method that would allow to use big models in consumer (~24Gb GPUs) thanks to fractional quantization methods. Would this be in the cards?

meditans

"WARNING: Invalid HTTP request received" and latency SGLANG vs VLLM

6

Hi team, I am using `sglang` with a local finetuned model (`basemodel_id = cognitivecomputations/dolphin-2.2.1-mistral-7b`). And running inference in a for loop. GPU: 4090 batch_sz=1 tokens_in ~ 2000 tokens_out ~200 ```...

jmlb

sglang.launch_server raise "POST /v1/chat/completions HTTP/1.1" 404 Not Found

4

songkq

outlines integration

import `outlines` instead of copy codes.

hnyls2002

sglang
sglang copied to clipboard

Metadata

Colab?

Max Length.

Add an async example

Triton support

Async support

Metal support?

[Quantization Support Request]: Exllamav2

"WARNING: Invalid HTTP request received" and latency SGLANG vs VLLM

sglang.launch_server raise "POST /v1/chat/completions HTTP/1.1" 404 Not Found

outlines integration

← Metadata

Owner

Metadata

sglang sglang copied to clipboard

Metadata

← Metadata

Owner

Metadata

sglang
sglang copied to clipboard