inference Randomly distribute traffic across multiple workers of the same model

Feature request / 功能建议

I have deployed one supervisor and two qwen2-vl-7b-instruct workers. However, I've noticed that currently clients can only query by model_id. I would like to query models by name, such as qwen2-vl-7b-instruct, and randomly distribute traffic across multiple workers of the same model. Is this currently supported?

Motivation / 动机

Additional workers should be deployed for models that require more resources.

Your contribution / 您的贡献

None

Feb 13 '25 12:02 LinJianping

Launch the same model with replica=2, the model will have 2 replicas on 2 workers.

Feb 13 '25 12:02 qinxuye

Launch the same model with replica=2, the model will have 2 replicas on 2 workers.

In my situation, I intend to launch multiple GPU Docker instances, each automatically initiating one xinference worker. Is this scenario suitable for utilizing the replicas configuration?

Feb 13 '25 12:02 LinJianping

That should work well.

Feb 13 '25 12:02 qinxuye

That should work well.

start supervisor: xinference-supervisor -H SupervisorAddress -p 8416 --supervisor-port 8417 start worker 1: xinference-worker -e "http://SupervisorAddress:8416" -H "Worker1Address" --worker-port 8418 launch model 1 on worker 1:

from xinference.client import RESTfulClient

client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
  model_engine="transformers",
  model_name="qwen2-vl-instruct",
  model_format="pytorch",
  model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)

start worker 2: xinference-worker -e "http://SupervisorAddress:8416" -H "Worker2Address" --worker-port 8418 launch model 2 on worker 2:

from xinference.client import RESTfulClient

client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
  model_engine="transformers",
  model_name="qwen2-vl-instruct",
  model_format="pytorch",
  model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)

How can I modify my deployment method?

Feb 13 '25 12:02 LinJianping

That should work well.

start supervisor: xinference-supervisor -H SupervisorAddress -p 8416 --supervisor-port 8417 start worker 1: xinference-worker -e "http://SupervisorAddress:8416" -H "Worker1Address" --worker-port 8418 launch model 1 on worker 1:
from xinference.client import RESTfulClient

client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
  model_engine="transformers",
  model_name="qwen2-vl-instruct",
  model_format="pytorch",
  model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)
start worker 2: xinference-worker -e "http://SupervisorAddress:8416" -H "Worker2Address" --worker-port 8418 launch model 2 on worker 2:
from xinference.client import RESTfulClient

client = RESTfulClient("http://SupervisorAddress:8416")
model_uid = client.launch_model(
  model_engine="transformers",
  model_name="qwen2-vl-instruct",
  model_format="pytorch",
  model_size_in_billions="7"
)
print('LLM Model uid: ' + model_uid)
How can I modify my deployment method?

I have tested that if all GPU Docker instances are ready and all workers have started, then launching the model once by setting replica to 2 works. However, in my scenario, I want to dynamically add workers to an existing model. Is there any method to achieve this?

Feb 13 '25 12:02 LinJianping

Oh, you mean dynamically scale replica, e.g. from 1 to 2 then to 3?

Feb 13 '25 14:02 qinxuye

Oh, you mean dynamically scale replica, e.g. from 1 to 2 then to 3?

Yes, the replica may need to be adjusted dynamically after the initial model launch due to traffic. I am hoping for support to add/delete workers and increase/decrease model replicas dynamically after the first model launch.

Feb 14 '25 00:02 LinJianping

Sorry this is the functionality of enterprise version.

Feb 14 '25 01:02 qinxuye

Sorry this is the functionality of enterprise version.

Got it, thank you for your kind reply.

Feb 14 '25 01:02 LinJianping