LinJianping

Results 10 comments of LinJianping

> Launch the same model with replica=2, the model will have 2 replicas on 2 workers. In my situation, I intend to launch multiple GPU Docker instances, each automatically initiating...

> That should work well. start supervisor: ` xinference-supervisor -H SupervisorAddress -p 8416 --supervisor-port 8417` start worker 1: `xinference-worker -e "http://SupervisorAddress:8416" -H "Worker1Address" --worker-port 8418` launch model 1 on worker...

> > That should work well. > > start supervisor: ` xinference-supervisor -H SupervisorAddress -p 8416 --supervisor-port 8417` start worker 1: `xinference-worker -e "http://SupervisorAddress:8416" -H "Worker1Address" --worker-port 8418` launch model...

> Oh, you mean dynamically scale replica, e.g. from 1 to 2 then to 3? Yes, the replica may need to be adjusted dynamically after the initial model launch due...

> Sorry this is the functionality of enterprise version. Got it, thank you for your kind reply.

> It cannot be reproduced with the latest main branch. When batch size is 1, everything runs fine. When batch_size is set to 8, the above error occasionally occurs during...

> I have asked an expert, the error might come from the vision model on the default stream. Which would corruption the capturing of language model in the other stream....

> We would capture multiple graphs with different input sizes, and the input would be padded to the capture size before forward. It is safe to use dynamic batching. What...

Another curious question is why TurboMind supports the 2B-76B InternVL2 model but not the 1B model. Are there any plans to support it in the future? @grimoire

> https://github.com/grimoire/lmdeploy/tree/fix-vl-graphcapture I have set the capture mode to thread_local, which might fix the bug. > > > What is the specific capture strategy like? > > https://github.com/grimoire/lmdeploy/blob/e16c49170f1413f23c03cac2d3549ca7b7f711c4/lmdeploy/pytorch/backends/cuda/graph_runner.py#L133 > >...