Understanding the WASM behind Web-LLM
Hi I really love the project and have already built several small websites with WebLLM integrated somewhere inside of them. Now I am starting to explore the more technical side, how it's build and did a deep dive into the code. What I'm missing is, where can I find the WASM Libs of the different LLM Architectures. As far as I've understood (and please correct me if I'm wrong) Web-LLM is handling everything until inference starts and the actual inference and buildup of the LLM Architecture is done in the WASM for performance reasons?
I want to look into the WASM libs to understand the whole inference engine a little better and would appreciate any help 😄
Hi I really love the project and have already built several small websites with WebLLM integrated somewhere inside of them. Now I am starting to explore the more technical side, how it's build and did a deep dive into the code. What I'm missing is, where can I find the WASM Libs of the different LLM Architectures. As far as I've understood (and please correct me if I'm wrong) Web-LLM is handling everything until inference starts and the actual inference and buildup of the LLM Architecture is done in the WASM for performance reasons?
I want to look into the WASM libs to understand the whole inference engine a little better and would appreciate any help 😄
Your phrasing is a bit opaque but to clarify - it's all managed / orchestrated via WebLLM. There's no magic handoff where WebLLM is inactive. It calls the WASM engine for the inference step, waits for the result, and then continues managing the overall process (WebLLM). Hope this help you a bit
Hi I really love the project and have already built several small websites with WebLLM integrated somewhere inside of them. Now I am starting to explore the more technical side, how it's build and did a deep dive into the code. What I'm missing is, where can I find the WASM Libs of the different LLM Architectures. As far as I've understood (and please correct me if I'm wrong) Web-LLM is handling everything until inference starts and the actual inference and buildup of the LLM Architecture is done in the WASM for performance reasons? I want to look into the WASM libs to understand the whole inference engine a little better and would appreciate any help 😄
Your phrasing is a bit opaque but to clarify - it's all managed / orchestrated via WebLLM. There's no magic handoff where WebLLM is inactive. It calls the WASM engine for the inference step, waits for the result, and then continues managing the overall process (WebLLM). Hope this help you a bit
Hi, yes thanks I have understood that. In the WebLLM project the WASM is loaded from the corresponding model. The wasm binaries are also in a repo. Is the inference Code somewhere open source viewable (not as binary but as Code)?
Hi there! Thanks for the discussion. MLC-LLM and TVM are the two sources for the implementation of the WASM (both WebGPU kernels and necessary runtime support such as tensor manipulation).
For instance, the following line in llm_chat.ts:
this.prefill = this.tvm.detachFromCurrentScope(
this.vm.getFunction("prefill"),
);
loads a compiled function called prefill defined in MLC-LLM. Each model architecture has its own prefill, and here is Llama's: https://github.com/mlc-ai/mlc-llm/blob/d2118b3c9d56da6d1e66dfe2667f650020417010/python/mlc_llm/model/llama/llama_model.py#L330