Liangsheng Yin
Liangsheng Yin
There are three types of memory in SGLang: 1. memory for model weights. 2. memory for KV cache. 3. temporary memory for intermediate computing results. The answer to your question...
@alessiodallapiazza We are welcome if you can submit a PR to add this feature.
It's due to `outlines` API changes, please downgrade `outlines
@for-just-we Could you please checkout if there is a KV cache leak during the inference or some other runtime errors?
@Ja1Zhou Of course, this logic can be implemented without recursion. I am unsure whether there would be so many nodes in a single path in the radix tree; it's very...
@Luodian 1. We don't support multi-node serving currently; it will be supported in the future. 2. Sorry to cause confusion between **tensor parallelism** and **frontend parallel**. The `parellel=8` means using...
@koalazf99 Yes, the `--tp-size` stands for tensor parallelism, which allows your server to run across multiple GPUs. This is the only configuration required to enable tensor parallelism. However, note that...
@koalazf99 Yes, the data parallelism is not supported yet.
@fisher75 请问你具体举的例子是哪一个,如果你想说的是在vision model(比如llava)中给定few shot学习的对象是图片并且利用这个作为sharing的context prefix,后面又添加另外的图片来推理的话,现在这些vision model应该是不支持的。因为现在只支持一张图片作为输入。
@fisher75 你可以参考这个tree_of_thought的benchmark https://github.com/sgl-project/sglang/blob/cb389c91bcff6ffac4a95a0551a05d67e21ba306/benchmark/tree_of_thought_deep/bench_sglang.py#L41-L70 image的API直接用 https://github.com/sgl-project/sglang/blob/cb389c91bcff6ffac4a95a0551a05d67e21ba306/examples/quick_start/srt_example_llava.py#L7-L10