mlc-llm [Doc] List of SLM Supported Models

PR #1508 yesterday makes it possible to JIT generate model libs on device - it means form now on, only model weights are strictly needed to be downloaded to run a model. This PR further simplifies the workflow with automatic model weight downloading/caching, so that the entire "quick start" flow can be simplified as running a two-liner Python script:

cm = ChatModule(MODEL, device="auto")
cm.generate("What is the meaning of life?", progress_callback=StreamToStdout(callback_interval=2))

where MODEL can be any MLC-released models such as "HF://junrushao/Llama-2-7b-chat-hf-q4f16_1-MLC".

To diversify the choice of MODEL, I left a simple script running since yesterday and it has uploaded 40-50 model prebuilts. The script also produces a markdown table so I uploaded it as MODEL_PREBUILTS.md.

The remaining problems on this front include:

Figure out its relationship between AOT and JIT in documentation, and the relationship between the existing table and the new one.
Quantize all existing models - there just ain't that many variants of model architecture, and for supported architecture, the only thing we need to do is to assemble a list, then find personal laptop and leave it running.
Again, documentation refactoring - let's keep our expectation as such: only advanced users need to understand words like "compilation", "quantization" or "linking", "cmake", etc.

Dec 29 '23 19:12 junrushao

CC: @tqchen @CharlieFRuan @davidpissarra

Dec 29 '23 19:12 junrushao

There are a couple of failed cases throughput my experiments:

{
  "destination": "{username}/{model_id}-{quantization}-MLC",
  "default_quantization": ["q3f16_1", "q4f16_1", "q4f32_1"],
  "tasks": [
    {"model_id": "llama2_7b_chat_uncensored", "model": "https://huggingface.co/georgesung/llama2_7b_chat_uncensored", "context_window_size": 4096, "conv_template": "llama-default"},
    {"model_id": "open_llama_3b", "model": "https://huggingface.co/openlm-research/open_llama_3b", "context_window_size": 2048, "conv_template": "llama-default"},
    {"model_id": "open_llama_7b", "model": "https://huggingface.co/openlm-research/open_llama_7b", "context_window_size": 2048, "conv_template": "llama-default"},
    {"model_id": "open_llama_13b", "model": "https://huggingface.co/openlm-research/open_llama_13b", "context_window_size": 2048, "conv_template": "llama-default"},
    {"model_id": "stablecode-instruct-alpha-3b", "model": "https://huggingface.co/stabilityai/stablecode-instruct-alpha-3b", "context_window_size": 4096, "conv_template": "stablecode_instruct"},
    {"model_id": "starcoder", "model": "https://huggingface.co/bigcode/starcoder", "context_window_size": 8192, "conv_template": "LM"},
    {"model_id": "gpt_bigcode-santacoder", "model": "https://huggingface.co/bigcode/gpt_bigcode-santacoder", "context_window_size": 2048, "conv_template": "LM"}
  ]
}

Dec 29 '23 20:12 junrushao

Thank you for the work! Regarding the big code models, I think it should be fixed by https://github.com/mlc-ai/mlc-llm/pull/1515

Dec 30 '23 12:12 CharlieFRuan

@LeshengJin has been working closely with me on this direction, and he found that:

I tested all models uploaded. Most of the models worked well, but the following models produced garbage, which means we need to debug the accuracy issue:

gpt2(Runtime Error: shape mismatch) gpt2-medium(Runtime Error: shape mismatch) WizardLM-7B-V1.0 WizardLM-30B-V1.0 WizardCoder-15B-V1.0 dolly-v2-12b pythia-1.4b

@CharlieFRuan would you like to further look into GPT2 and WizardLM ones given you guys have been contributing to this front a lot? Thanks a bunch!

Link to the rendered markdown table: https://github.com/mlc-ai/mlc-llm/blob/68bb30cbb0a5d551bafda13b5e9dd7818d11f6e5/MODEL_PREBUILTS.md

Dec 30 '23 23:12 junrushao

For WizardLM-7B-V1.0 and WizardLM-30B-V1.0, their weights on HF are weight delta; and I think they are somewhat obsolete already (they are pre-llama2). We can probably just support WizardLM-13B-V1.2 and WizardLM-70B-V1.0 (like our old-workflow prebuilts).

For WizardCoder-15B-V1.0, I recompiled the weights and models, but runs into CUDA: out of memory despite there is plenty memory (was able run other models):

/path/to/tvm-unity/src/runtime/memory/pooled_allocator.h:65: Warning: PooledAllocator got InternalError during allocation: InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory

~~@davidpissarra Could this be something related to https://github.com/mlc-ai/mlc-llm/pull/1515?~~ Setting context window size smaller fixes this.

GPT2s seem to work fine on my end (bad output probably due to the model; used gpt2-based model for a musicLM its output was fine):

Dec 31 '23 08:12 CharlieFRuan

Thanks for getting back to me so quickly @CharlieFRuan!

For WizardLM-7B-V1.0 and WizardLM-30B-V1.0, their weights on HF https://github.com/mlc-ai/mlc-llm/pull/489; and I think they are somewhat obsolete already (they are pre-llama2). We can probably just support WizardLM-13B-V1.2 and WizardLM-70B-V1.0 (like our old-workflow prebuilts).

Sounds fair! Would you mind helping me deprecate Wizard V1.0s by removing them in https://docs.google.com/spreadsheets/d/1Jka-soYrRSGr2hQ5LPkaY8HeGIeUBnHaR72x_AKZ2V8/. Also, please fill in more models you know of and I will just keep my workstation running.

For WizardCoder-15B-V1.0, I recompiled the weights and models, but runs into CUDA: out of memory despite there is plenty memory (was able run other models):

I'm not completely sure, but my guess is that context_window_size is by default 8192 as being read from mlc-chat-config.json, which may lead to too much VRAM in the prefill method. PR #1522 allows us to tweak context_window_size and prefill_chunk_size in ChatConfig to control JIT behavior - you may want to try it out!

GPT2s seem to work fine on my end (bad output probably due to the model

Got it! It might be just GPT2 being too small. Thanks for the confirmation!

There are a couple of follow-up items I'd love to bring up:

B1. I noticed that I set context_window_size=1024 and prefill_chunk_size=4096, it still considers prefill accepts sequence length up to 4096 according to the memory usage information it prints on screen during compilation. This is probably a behavior we wanted to fix.

[2023-12-31 00:42:16] INFO estimate_memory_usage.py:55: [Memory usage] Function `_initialize_effect`: 0.00 MB
[2023-12-31 00:42:17] INFO estimate_memory_usage.py:55: [Memory usage] Function `decode`: 0.21 MB
[2023-12-31 00:42:17] INFO estimate_memory_usage.py:55: [Memory usage] Function `prefill`: 354.13 MB
[2023-12-31 00:42:17] INFO estimate_memory_usage.py:55: [Memory usage] Function `softmax_with_temperature`: 0.00 MB

B2. This is actually possible to print out the estimated memory usage given an SLM-compiled model lib even before parameter loading - by inspecting its metadata. The total VRAM is "param + max temp memory in methods + kv cache". It will be super helpful to end users to learn how much it exactly needs to tweak ChatConfig accordingly, given there have been tons of user reports saying OOM stuff.

Dec 31 '23 08:12 junrushao

@junrushao Ahh yes! WizardCoder is fixed by setting context window size to smaller -- I think it is working fine! I added a WizardLM and two smaller WizardCoder Python to the spreadsheet.

For B1, yes I think we can set a constraint to prefill_chunk_size <= context_window_size or sliding_window_size during gen_config and compile. Though I think it is okay if prefill_chunk_size is larger than sliding_window_size, where we simply cache the last sliding_window_size number of tokens, but I think it should be fine for us to disregard that possibility. cc @davidpissarra

I would love to help push on this front, but I am afraid that I do not have much bandwidth before mid-January... I am prioritizing some effort on the web-llm front. Apologies:((

Dec 31 '23 09:12 CharlieFRuan

I would love to help push on this front, but I am afraid that I do not have much bandwidth before mid-January...I am prioritizing some effort on the web-llm front. Apologies:((

No worries! It's done in #1525.

Jan 01 '24 04:01 junrushao

closing as the delivery flow now lands

Jun 07 '24 13:06 tqchen