Hai
Hai
> httpx defaults use 10 connections. Just 10 connections alone should not cause memory leaks on their own. Assuming you have a Python backend that uses Chroma (server) via `HttpClient`,...
> Hi [@llmadd](https://github.com/llmadd) , going back to the original problem the issue is that you are creating a new client on every API request. you should instead have 1 global...
For errors caused by excessive tensor parallelism, you can set --enable-expert-parallel. Refer to: https://github.com/vllm-project/vllm/issues/17327
> > 对于因张量并行度过高导致的错误,可以设置 --enable-expert-parallel 参数。 参考:[#17327](https://github.com/vllm-project/vllm/issues/17327) > > Deploying a model with such settings might reduce the inference efficienc使用这些设置部署模型可能会降低推理效率 I'm not quite sure. I tried EP or TP 4 PP...
> Just so you know, we were able to run it successfully, and this `--enable-expert-parallel` helped us get a step closer. > > We're running on a g5.48xlarge with 8x...
``` python -m sglang.launch_server --model-path /odb/zh/gte_Qwen2-7B-instruct --host 0.0.0.0 --is-embedding ``` When using sglang to run the embedding model with the OpenAI SDK, an empty input (input="") reliably causes a RuntimeError:...