Randomly distribute traffic across multiple workers of the same model
Feature request / 功能建议
I have deployed one supervisor and two qwen2-vl-7b-instruct workers. However, I've noticed that currently clients can only query by model_id. I would like to query models by name, such as qwen2-vl-7b-instruct, and randomly distribute traffic across multiple workers of the same model. Is this currently supported?
Motivation / 动机
Additional workers should be deployed for models that require more resources.
Your contribution / 您的贡献
None
Launch the same model with replica=2, the model will have 2 replicas on 2 workers.
Launch the same model with replica=2, the model will have 2 replicas on 2 workers.
In my situation, I intend to launch multiple GPU Docker instances, each automatically initiating one xinference worker. Is this scenario suitable for utilizing the replicas configuration?
That should work well.
That should work well.
start supervisor:
xinference-supervisor -H SupervisorAddress -p 8416 --supervisor-port 8417
start worker 1:
xinference-worker -e "http://SupervisorAddress:8416" -H "Worker1Address" --worker-port 8418
launch model 1 on worker 1:
from xinference.client import RESTfulClient
client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
model_engine="transformers",
model_name="qwen2-vl-instruct",
model_format="pytorch",
model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)
start worker 2:
xinference-worker -e "http://SupervisorAddress:8416" -H "Worker2Address" --worker-port 8418
launch model 2 on worker 2:
from xinference.client import RESTfulClient
client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
model_engine="transformers",
model_name="qwen2-vl-instruct",
model_format="pytorch",
model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)
How can I modify my deployment method?
That should work well.
start supervisor:
xinference-supervisor -H SupervisorAddress -p 8416 --supervisor-port 8417start worker 1:xinference-worker -e "http://SupervisorAddress:8416" -H "Worker1Address" --worker-port 8418launch model 1 on worker 1:from xinference.client import RESTfulClient client = RESTfulClient("http://SupervisorAddress:8416") model_uid = client.launch_model( model_engine="transformers", model_name="qwen2-vl-instruct", model_format="pytorch", model_size_in_billions="7" ) print('LLM Model uid: ' + model_uid)start worker 2:
xinference-worker -e "http://SupervisorAddress:8416" -H "Worker2Address" --worker-port 8418launch model 2 on worker 2:from xinference.client import RESTfulClient client = RESTfulClient("http://SupervisorAddress:8416") model_uid = client.launch_model( model_engine="transformers", model_name="qwen2-vl-instruct", model_format="pytorch", model_size_in_billions="7" ) print('LLM Model uid: ' + model_uid)How can I modify my deployment method?
I have tested that if all GPU Docker instances are ready and all workers have started, then launching the model once by setting replica to 2 works. However, in my scenario, I want to dynamically add workers to an existing model. Is there any method to achieve this?
Oh, you mean dynamically scale replica, e.g. from 1 to 2 then to 3?
Oh, you mean dynamically scale replica, e.g. from 1 to 2 then to 3?
Yes, the replica may need to be adjusted dynamically after the initial model launch due to traffic. I am hoping for support to add/delete workers and increase/decrease model replicas dynamically after the first model launch.
Sorry this is the functionality of enterprise version.
Sorry this is the functionality of enterprise version.
Got it, thank you for your kind reply.