mlc-llm
mlc-llm copied to clipboard
[NOTICE] Transition from ChatModule to MLCEngine
As we start to formalize MLC LLM Engine, we are moving towards a more comprehensive API that is OpenAI compatible. This means a lot of new features that allows us to do more things across our backends, including
- JSON mode and function calls
- Multimodality
- Prefix and prompt caching
- Speculative decoding
The project started with ChatModule, which primarily focuses on Chat. This is a note to community that we are planning to phasing out ChatModule in favor of the MLCEngine. Another advantage is that after this we will have a single engine that backs all our backends, enabling features in one backend quickly enabled in another.
Transition
As of now ChatModule is still available, but we try to avoid mention it in docs. The current mlc_llm chat is still backed by the ChatModule. We are working on a JSONFFIEngine, with pure json_string input/output, to enable us to expose a broader set of interface(as in openai) to broader set of backends. So the transition will happen once JSONFFIEngine lands. One additional thing we will need is automatic prefix caching to speedup multiround chat. Backends like iOS and android will interface with JSONFFIEngine (that have full openai features)
- [x] JSONFFIEngine
- [x] PrefixCache
Additionally, we understand that there is a desire to access MLC through a low-level API, that directly leverages TVM runtime. ChatModule and its CLI has been useful for some debugging purposes.
For such low-level debugging, we do not necessarily need the full engine that supports continuos batching and spec decoding. We introduce a debug chat https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/testing/debug_chat.py, which features more inspection and single round input output generation. We can consider build possible c++ versions of it as well .
We are still working on some of the above items, so this is not an immediate item, but we would like to bring awareness to the community.
@tqchen Does a new engine still support custom or modified chat/conversation templates in MLC config? For example, sometimes there are differences between the original model and the fine-tuned model.
@MikeLP yes, we should keep such customization
https://github.com/mlc-ai/mlc-llm/pull/2279 brings an initial iOS ver of MLCEngine
https://github.com/mlc-ai/mlc-llm/pull/2380 transitions iOS ChatApp to MLCEngine
https://github.com/mlc-ai/mlc-llm/pull/2410 transitions the android to the MLCEngine
We have completed the transition steps