web-llm support concurrent inference from multiple models

I would like to stream the response from two different LLMs simultaneously

Jul 23 '24 20:07 mikestaub

Thanks for the request. Having multiple models in a single engine simultaneously is something we are looking into now. Meanwhile, would having two MLCEngine work for your case?

Jul 23 '24 20:07 CharlieFRuan

Yes that should work, assuming the device has enough resources, is this possible today? Is there an example I can play with?

Jul 23 '24 21:07 mikestaub

Hi @mikestaub, from npm 0.2.60, a single engine can load multiple models, and the models can process requests concurrently. However, I have not tested the performance benefit (if any) to process requests simultaneously, as opposed to sequentially. Though being able to load multiple models definitely brings convenience, making the engine behave like an endpoint like OpenAI(), assuming enough resources from the device.

Note: each model can still only process one request at a time (i.e. concurrent batching is not supported).

The two main related PRs are:

https://github.com/mlc-ai/web-llm/pull/542
- Main changes needed to support loading multiple models in an engine
https://github.com/mlc-ai/web-llm/pull/546
- A patch to the PR above to support simultaneously processing/streaming response

See examples/multi-models for an example, which has the effect below with parallelGeneration():

https://github.com/user-attachments/assets/0c5188de-b7ed-496d-a56a-28af36b11e0a

Aug 13 '24 20:08 CharlieFRuan

Closing this issue as completed. Feel free to reopen/open new ones if issues arise!

Aug 23 '24 17:08 CharlieFRuan