[Discussion] Limitations and boundaries of web-llm
To me web-llm excites me the most. Im designing tools that works within the browser without having to consume an LLM via service. Of course Im speaking more of SLM than LLM. My favourite model so far is phi-2, however StableLM 1.6 is very promising!. The problem comes with the WASM memory limit 😢
These days I have been setting prefill-chunk-size to 1024 as seen in the binary repo. Why that default?
The biggest limit that I see now is the context size. Would be possible to have a 8K, 16K, 32K context LLM in web-llm at some point? Whats the biggest experiment done so far?
While most of the models we support currently are simply not trained on > 4K context length, the Mistral model uses sliding window attention, meaning it technically can deal with context longer than 4K; with Attention Sink, it performs even better.
With that being said, the current Mistral wasm uses 4K sliding window size and a 1K prefill chunk size. The 1K prefill chunk size is to limit the memory usage -- TVM plans memory ahead of time, and the prefill chunk size determines the memory that intermediate matrix multiplications take. Thus, if you'd want a 8K sliding window size, perhaps try tune the prefill chunk size to say 512 or 256, and it should work memory-wise. But if you usually prefill a long text, it would be slower (not sure how much though).
For other wasms like Llama, it currently has 4K context window size and 1K prefill chunk size. Note the difference between context window size and prefill chunk size (former determines chatting experience, latter only determines tradeoff between memory and speed). I personally do not know a < 3B model that supports >=8K context window size, but I may be wrong.
@CharlieFRuan 🙏
I personally do not know a < 3B model that supports >=8K context window size, but I may be wrong.
Im working right now with 32K 7B (not webgpu). My special needs requires a long context. And I want to push the boundaries of 3B 2.2B and 1.6B, lets see...
I was looking for the same thing, and I'm happy to report there are now many small models with large context. Ones I've found:
8K H2O-Danube2-1.8b https://www.reddit.com/r/LocalLLaMA/comments/1by6uuf/h2odanube218b_new_top_sub_2b_model_on_open_llm/
16K Ninja Mouse https://www.reddit.com/r/LocalLLaMA/comments/1bthpxz/if_you_like_it_small_try_ninja_mouse/
32K Qwen 1.5 4B Most of their smaller models have 32K: https://huggingface.co/Qwen/Qwen1.5-1.8B
And the elephant in the room: the new Mistral 7B with 32K - the one model to rule them all. https://github.com/mlc-ai/web-llm/issues/349
Due to myriads of work I have not been able to review this lately (or try BNF!).
Last time I was working on this, for some reason web-llm is not working as expected with Mistral 7B. Trying the version of NeuralHermes that @CharlieFRuan uploaded the results are comparable to GGUF Q2_K. Not very good outputs :(
We need tests and benchmarks here.
Good point; we currently have relatively trivial unit tests, but ones that involve the actual usage of WebGPU are not included yet. I have not looked into how testing with WebGPU works, but definitely something to work on.
@CharlieFRuan I can help with testing. I have been doing it with Cypress functions using web-llm. Let me prepare it tomorrow reusing my code
That'd be great, thank you so much! I also found this blog that might be related https://developer.chrome.com/blog/supercharge-web-ai-testing