mlc-llm [iOS] - `dolly-v2-3b` latency + accuracy

[iOS] - `dolly-v2-3b` latency + accuracy

Open terryworona opened this issue 2 years ago • 0 comments

Hey all,

TLDR

I managed to build and deploy the dolly-v2-3b model to iOS iPad Pro M1 (thanks to the helpful suggestions in #129 and #116). However, I noticed that both the latency and accuracy of the model is quite poor (see below).

Why Dolly? I want to use a model open to commercial use. If there are any alternatives beyond Dolly that would work with this library, I'm all ears :)

Observations

Latency is quite long - it can take several minutes to stream a response. Is this typical for an M1 device?
Most output is inaccurate - see attachment of a few simple prompts. I find that occasionally the answer is correct, but it's surrounded by a bunch of irrelevant/erroneous content. The model never seems to stop streaming - most prompts go on forever - oftentimes in loops (see second example, where it repeatedly gets stuck streaming "mix, mix, mix..."). Is this the expected performance of Dolly, or is something somehow misconfigured? (See: Appendix below).

238210455-96bcfa11-4f1c-461b-b75d-42fe7379f985

Appendix

Setup instructions below:

Download Model

$ git lfs install && git clone https://huggingface.co/databricks/dolly-v2-3b

Link Model

$ mkdir -p dist/models 
$ ln -s /Users/me/Desktop/dolly-v2-3b dist/models/dolly-v2-3b

Build Model

$ python3 build.py --model dolly-v2-3b --dtype float16 --target iphone --quantization-mode int3 --quantization-sym --quantization-storage-nbit 16 --max-seq-len 768

Mobile Integration

$ ./prepare_libs.sh
$ ./prepare_params.sh

I had to update both scripts to point to dolly-v2-3b (previously hardcoded to vicuna-v1-7b). I also had to update prepare_params to move Dolly's tokenizer.json (instead of .model) and update the mobile code base:

 std::string tokenizer_path = bundle_path + "/dist/tokenizer.json";

I also noticed via LLMChat.mm, that the model and conv_template properties are hardcoded to vircuna and vicuna_v1.1 respectively:

std::string model = "vircuna";
std::string conv_template = "vicuna_v1.1";

Do these matter in any way, or is it just a namespace thing? Should they be updated to point to Dolly?

May 16 '23 19:05 terryworona

mlc-llm mlc-llm copied to clipboard

[iOS] - `dolly-v2-3b` latency + accuracy

TLDR

Observations

Appendix

Download Model

Link Model

Build Model

Mobile Integration

mlc-llm
mlc-llm copied to clipboard