LinJianping comments

Results 10 comments of


                                            LinJianping

Randomly distribute traffic across multiple workers of the same model

> Launch the same model with replica=2, the model will have 2 replicas on 2 workers. In my situation, I intend to launch multiple GPU Docker instances, each automatically initiating...

Randomly distribute traffic across multiple workers of the same model

> That should work well. start supervisor: ` xinference-supervisor -H SupervisorAddress -p 8416 --supervisor-port 8417` start worker 1: `xinference-worker -e "http://SupervisorAddress:8416" -H "Worker1Address" --worker-port 8418` launch model 1 on worker...

Randomly distribute traffic across multiple workers of the same model

> > That should work well. > > start supervisor: ` xinference-supervisor -H SupervisorAddress -p 8416 --supervisor-port 8417` start worker 1: `xinference-worker -e "http://SupervisorAddress:8416" -H "Worker1Address" --worker-port 8418` launch model...

Randomly distribute traffic across multiple workers of the same model

> Oh, you mean dynamically scale replica, e.g. from 1 to 2 then to 3? Yes, the replica may need to be adjusted dynamically after the initial model launch due...

Randomly distribute traffic across multiple workers of the same model

> Sorry this is the functionality of enterprise version. Got it, thank you for your kind reply.

[Bug] RuntimeError: CUDA error: operation not permitted when stream is capturing

> It cannot be reproduced with the latest main branch. When batch size is 1, everything runs fine. When batch_size is set to 8, the above error occasionally occurs during...

[Bug] RuntimeError: CUDA error: operation not permitted when stream is capturing

> I have asked an expert, the error might come from the vision model on the default stream. Which would corruption the capturing of language model in the other stream....

[Bug] RuntimeError: CUDA error: operation not permitted when stream is capturing

> We would capture multiple graphs with different input sizes, and the input would be padded to the capture size before forward. It is safe to use dynamic batching. What...

[Bug] RuntimeError: CUDA error: operation not permitted when stream is capturing

Another curious question is why TurboMind supports the 2B-76B InternVL2 model but not the 1B model. Are there any plans to support it in the future? @grimoire

[Bug] RuntimeError: CUDA error: operation not permitted when stream is capturing

> https://github.com/grimoire/lmdeploy/tree/fix-vl-graphcapture I have set the capture mode to thread_local, which might fix the bug. > > > What is the specific capture strategy like? > > https://github.com/grimoire/lmdeploy/blob/e16c49170f1413f23c03cac2d3549ca7b7f711c4/lmdeploy/pytorch/backends/cuda/graph_runner.py#L133 > >...