Hai comments

Results 16 comments of

Hai

[Feature Request]: Add a close method to HttpClient

> httpx defaults use 10 connections. Just 10 connections alone should not cause memory leaks on their own. Assuming you have a Python backend that uses Chroma (server) via `HttpClient`,...

[Feature Request]: Add a close method to HttpClient

> Hi [@llmadd](https://github.com/llmadd) , going back to the original problem the issue is that you are creating a new client on every API request. you should instead have 1 global...

[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models

For errors caused by excessive tensor parallelism, you can set --enable-expert-parallel. Refer to: https://github.com/vllm-project/vllm/issues/17327

[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models

> > 对于因张量并行度过高导致的错误，可以设置 --enable-expert-parallel 参数。参考：[#17327](https://github.com/vllm-project/vllm/issues/17327) > > Deploying a model with such settings might reduce the inference efficienc使用这些设置部署模型可能会降低推理效率 I'm not quite sure. I tried EP or TP 4 PP...

[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models

> Just so you know, we were able to run it successfully, and this `--enable-expert-parallel` helped us get a step closer. > > We're running on a g5.48xlarge with 8x...

[Bug] RuntimeError: RMSNorm failed with error code invalid configuration argument

``` python -m sglang.launch_server --model-path /odb/zh/gte_Qwen2-7B-instruct --host 0.0.0.0 --is-embedding ``` When using sglang to run the embedding model with the OpenAI SDK, an empty input (input="") reliably causes a RuntimeError:...