Liangsheng Yin
Liangsheng Yin
Thanks for reporting this warning! Could you provide more details so we can reproduce that or find where the bugs are?
The leaked KV cache is too large. I just reproduced the error in A10 24G with llama-13b-hf AWQ, I noticed that during benchmarking, the backend logs report CUDA-out-of-memory errors. ```...
That's strange. I am unsure whether the flush cache function conflicts with Flashinfer's kernels. I will check it soon.
@comaniac After updating the `flashinfer` into its wheel edition, I didn't come across the OOM problem after flushing the cache as you described. Could you please check again to make...
@bigplayer-ai Try this https://github.com/lehni/vscode-fix-checksums/issues/7. 1. Disconnect from the remote server. 2. Apply the fix checksum and restart the vscode. 3. Then connect to the remote server, the checksum has been...
This problem only occurs in macOS X. In Windows, when I press `Enter`, it will not trigger the submission option
@mounamokaddem Try to decrease `mem-fraction-static` as sglang requires more free spaces to allocate when the tensor parallelism size is large.
@Gintasz Try to decrease the `--mem-fraction-static` more. As you are using the logprobs utils, it requires more unallocated GPU memory during temporary processing tensor.
@vedantroy You are not using the local model properly. It is not the OpenAI endpoint but SGLang endpoint. What you mentioned is using SGLang endpoint in OpenAI compatible APIs. This...
@vedantroy Sure, of course, we do automatically batching for both the prefilling and the decoding phases. The prefilling batching has two constraints: 1. The tokens of the newly filled requests...