support concurrent inference from multiple models
I would like to stream the response from two different LLMs simultaneously
Thanks for the request. Having multiple models in a single engine simultaneously is something we are looking into now. Meanwhile, would having two MLCEngine work for your case?
Yes that should work, assuming the device has enough resources, is this possible today? Is there an example I can play with?
Hi @mikestaub, from npm 0.2.60, a single engine can load multiple models, and the models can process requests concurrently. However, I have not tested the performance benefit (if any) to process requests simultaneously, as opposed to sequentially. Though being able to load multiple models definitely brings convenience, making the engine behave like an endpoint like OpenAI(), assuming enough resources from the device.
Note: each model can still only process one request at a time (i.e. concurrent batching is not supported).
The two main related PRs are:
- https://github.com/mlc-ai/web-llm/pull/542
- Main changes needed to support loading multiple models in an engine
- https://github.com/mlc-ai/web-llm/pull/546
- A patch to the PR above to support simultaneously processing/streaming response
See examples/multi-models for an example, which has the effect below with parallelGeneration():
https://github.com/user-attachments/assets/0c5188de-b7ed-496d-a56a-28af36b11e0a
Closing this issue as completed. Feel free to reopen/open new ones if issues arise!