wasm optimization?
When I use the web-llm instance (path: /web-llm/examples/simple-chat), and observe the source file (@mlc-ai/web-llm/lib/index.js), I notice that there is a lot of interaction with wasm files, which makes reading the source code somewhat difficult. I would be very grateful if you could inform me of the logical content of all the wasm files! Additionally, I have observed that there seems to be room for optimization in the implementation of model files (for example: "model_lib_url": modelLibURLPrefix + modelVersion + "/Llama-3-8B-Instruct-q4f32_1-ctx1k_cs1k-webgpu.wasm"). May I inquire if I should optimize through modifying the TVM compilation process?
Thanks for the question! The wasm is composed of various parts, including the kernel of the model (in WGSL), and runtime support (C++ code compiled into WASM).
-
The kernel is implemented in MLC-LLM and compiled to WGSL: https://llm.mlc.ai/docs/deploy/webllm.html#bring-your-own-model-library
-
Runtime support from MLC-LLM: https://github.com/mlc-ai/mlc-llm/blob/main/web/emcc/mlc_wasm_runtime.cc
-
Runtime support from TVM (one of the three files): https://github.com/apache/tvm/blob/main/web/emcc/wasm_runtime.cc
-
The kernel, the runtime support (compiled into
.bc) are then linked together to form the final.wasmfile: https://github.com/apache/tvm/blob/main/python/tvm/contrib/emcc.py