sglang
sglang copied to clipboard
SGLang is a fast serving framework for large language models and vision language models.
Colab?
Awesome project. We have a paper https://arxiv.org/abs/2310.14034 with really complicated KV caching that I would love to go back and implement in SGLang. I tried to get an example working...
Is there anyway to truncate text based on tokens? I really like that as a user I don't need to think about tokens. But to save memory I would like...
Resolve #29
Hello, curious if we can already use sglang as a backend for NVIDIA's Triton Server. Amazing work with the library btw, love it!
Hey, when is planned the support for Metal backend?
Exllamav2 is an excellent quantization method that would allow to use big models in consumer (~24Gb GPUs) thanks to fractional quantization methods. Would this be in the cards?
Hi team, I am using `sglang` with a local finetuned model (`basemodel_id = cognitivecomputations/dolphin-2.2.1-mistral-7b`). And running inference in a for loop. GPU: 4090 batch_sz=1 tokens_in ~ 2000 tokens_out ~200 ```...
import `outlines` instead of copy codes.