aresnow1
aresnow1
Embedding is a CPU-intensive call, and even for a stateless actor, it is not executed simultaneously because the current loop lock is not released until the first call. Therefore, the...
Thanks for your feedback, it appears that the port of the supervisor is not exposed. We will fix this issue in the next version.
> 另外,再请问下,如果GPU编号不是连续的,比如有5个GPU,0-4,3被其他模型占用了,0,1,2和4共3个,用n_gpu=4也可以吗? n_gpu=4 最好保证有四个闲置的 GPU
Do you have GPU cards on your machine?
We have created a pull request: https://github.com/langchain-ai/langchain/pull/12702, waiting for it to be merged!
Additionally, model ability must select "chat".
Python API of in-flight batching is needed for this PR, and TensorRT-LLM team says it will be implemented in next versions.
Refer to the documentation: https://inference.readthedocs.io/en/latest/getting_started/using_xinference.html#configure-xinference-home-path
> I have similar question too. I downloa models from huggingface via aria2, cause it can download with multi thread support. After downloaded, don't know how to put the models...
Currently, there is no interface that can directly shut down the cluster. An interface for stopping can be added to the cluster API.