mlc-llm
mlc-llm copied to clipboard
[iOS] - `dolly-v2-3b` latency + accuracy
Hey all,
TLDR
I managed to build and deploy the dolly-v2-3b model to iOS iPad Pro M1 (thanks to the helpful suggestions in #129 and #116). However, I noticed that both the latency and accuracy of the model is quite poor (see below).
Why Dolly? I want to use a model open to commercial use. If there are any alternatives beyond Dolly that would work with this library, I'm all ears :)
Observations
- Latency is quite long - it can take several minutes to stream a response. Is this typical for an M1 device?
- Most output is inaccurate - see attachment of a few simple prompts. I find that occasionally the answer is correct, but it's surrounded by a bunch of irrelevant/erroneous content. The model never seems to stop streaming - most prompts go on forever - oftentimes in loops (see second example, where it repeatedly gets stuck streaming "mix, mix, mix..."). Is this the expected performance of Dolly, or is something somehow misconfigured? (See: Appendix below).
Appendix
Setup instructions below:
Download Model
$ git lfs install && git clone https://huggingface.co/databricks/dolly-v2-3b
Link Model
$ mkdir -p dist/models
$ ln -s /Users/me/Desktop/dolly-v2-3b dist/models/dolly-v2-3b
Build Model
$ python3 build.py --model dolly-v2-3b --dtype float16 --target iphone --quantization-mode int3 --quantization-sym --quantization-storage-nbit 16 --max-seq-len 768
Mobile Integration
$ ./prepare_libs.sh
$ ./prepare_params.sh
I had to update both scripts to point to dolly-v2-3b (previously hardcoded to vicuna-v1-7b). I also had to update prepare_params to move Dolly's tokenizer.json (instead of .model) and update the mobile code base:
std::string tokenizer_path = bundle_path + "/dist/tokenizer.json";
I also noticed via LLMChat.mm, that the model and conv_template properties are hardcoded to vircuna and vicuna_v1.1 respectively:
std::string model = "vircuna";
std::string conv_template = "vicuna_v1.1";
Do these matter in any way, or is it just a namespace thing? Should they be updated to point to Dolly?