web-llm Model request: Phi 3 mini 128K

Seems like a good match for WebLLM, as it was practicaly designed to run in the browser.

From this reddit thread: https://www.reddit.com/r/LocalLLaMA/comments/1d2o445/comment/l63cvxk/

May 29 '24 20:05 flatsiedatsie

Phi3-mini, StableLM 1.6B, Qwen 1.8B were just added to the prebuilt list here: https://github.com/mlc-ai/web-llm/pull/433

Will bump the version to 0.2.39 soon.

Note the phi3 we added was 4k instead of 128K.

If I understand correctly, to support 128K context length, we need to allocate a KV cache that is 128K on the sequence dimension, which yields to head_dim * num_layer * num_kv_heads * {k,v} * size(f16) * 128K bytes, i.e. 96 * 32 * 32 * 2 * 2 * 128000, which is 46GB, as opposed to 1.5GB for 4k context length.

May 29 '24 21:05 CharlieFRuan

Just published 0.2.39; those models are now included in the prebuilt app config!

May 30 '24 05:05 CharlieFRuan

Very nice.

If you don't mind I'll keep this open for now? I think the 128K context version would still offer something valueable to WebLLM.

May 31 '24 07:05 flatsiedatsie

npm 0.2.62 now supports Phi3.5-mini: https://github.com/mlc-ai/web-llm/pull/556

Phi-3.5-mini comes with support up to 128K context (unlike Phi-3-mini which only has 4k) thanks to rope scaling which MLC-LLM supports, which you can take advantage of in WebLLM by increasing ModelRecord.overrides.context_window_size or specifying it in ChatOptions when loading a model, as long as there is enough VRAM.

Aug 23 '24 16:08 CharlieFRuan

Closing this issue for now as Phi-3.5 should suffice the need described. Feel free to open new ones if new issues arise!

Aug 23 '24 16:08 CharlieFRuan

Brilliant, thank you!

Aug 23 '24 17:08 flatsiedatsie