web-llm icon indicating copy to clipboard operation
web-llm copied to clipboard

support concurrent inference from multiple models

Open mikestaub opened this issue 1 year ago • 2 comments

I would like to stream the response from two different LLMs simultaneously

mikestaub avatar Jul 23 '24 20:07 mikestaub

Thanks for the request. Having multiple models in a single engine simultaneously is something we are looking into now. Meanwhile, would having two MLCEngine work for your case?

CharlieFRuan avatar Jul 23 '24 20:07 CharlieFRuan

Yes that should work, assuming the device has enough resources, is this possible today? Is there an example I can play with?

mikestaub avatar Jul 23 '24 21:07 mikestaub

Hi @mikestaub, from npm 0.2.60, a single engine can load multiple models, and the models can process requests concurrently. However, I have not tested the performance benefit (if any) to process requests simultaneously, as opposed to sequentially. Though being able to load multiple models definitely brings convenience, making the engine behave like an endpoint like OpenAI(), assuming enough resources from the device.

Note: each model can still only process one request at a time (i.e. concurrent batching is not supported).

The two main related PRs are:

  • https://github.com/mlc-ai/web-llm/pull/542
    • Main changes needed to support loading multiple models in an engine
  • https://github.com/mlc-ai/web-llm/pull/546
    • A patch to the PR above to support simultaneously processing/streaming response

See examples/multi-models for an example, which has the effect below with parallelGeneration():

https://github.com/user-attachments/assets/0c5188de-b7ed-496d-a56a-28af36b11e0a

CharlieFRuan avatar Aug 13 '24 20:08 CharlieFRuan

Closing this issue as completed. Feel free to reopen/open new ones if issues arise!

CharlieFRuan avatar Aug 23 '24 17:08 CharlieFRuan