Liangsheng Yin comments

Results 38 comments of


                                            Liangsheng Yin

kv cache pool leak detected when benchmark llama13B-awq using A40

Thanks for reporting this warning! Could you provide more details so we can reproduce that or find where the bugs are?

kv cache pool leak detected when benchmark llama13B-awq using A40

The leaked KV cache is too large. I just reproduced the error in A10 24G with llama-13b-hf AWQ, I noticed that during benchmarking, the backend logs report CUDA-out-of-memory errors. ```...

OOM after flush_cache when flashinfer is enabled

That's strange. I am unsure whether the flush cache function conflicts with Flashinfer's kernels. I will check it soon.

OOM after flush_cache when flashinfer is enabled

@comaniac After updating the `flashinfer` into its wheel edition, I didn't come across the OOM problem after flushing the cache as you described. Could you please check again to make...

Command 'Fix Checksums: Apply' resulted in an error (Cannot convert undefined or null to object)

@bigplayer-ai Try this https://github.com/lehni/vscode-fix-checksums/issues/7. 1. Disconnect from the remote server. 2. Apply the fix checksum and restart the vscode. 3. Then connect to the remote server, the checksum has been...

Enter button automatically trigger confirmation of the input text.

This problem only occurs in macOS X. In Windows, when I press `Enter`, it will not trigger the submission option

OOM CUDA error on 8 * L4 machine when launching sglang server

@mounamokaddem Try to decrease `mem-fraction-static` as sglang requires more free spaces to allocate when the tensor parallelism size is large.

/generate request possibly hanging when `CUDA out of memory` is thrown

@Gintasz Try to decrease the `--mem-fraction-static` more. As you are using the logprobs utils, it requires more unallocated GPU memory during temporary processing tensor.

I can't use the OpenAI endpoint with images?

@vedantroy You are not using the local model properly. It is not the OpenAI endpoint but SGLang endpoint. What you mentioned is using SGLang endpoint in OpenAI compatible APIs. This...

Does sglang do automatic batching?

@vedantroy Sure, of course, we do automatically batching for both the prefilling and the decoding phases. The prefilling batching has two constraints: 1. The tokens of the newly filled requests...