mlc-llm icon indicating copy to clipboard operation
mlc-llm copied to clipboard

[Bug] gorilla-openfunctions-v1-q4f16_1-MLC crashes on JIT lib build on cuda12.2

Open Sing-Li opened this issue 10 months ago • 7 comments

🐛 Bug

Trying to serve gorilla openfunctions v1 will crash during initial jit library build. Same happens with openfunctions v2 and f16 or f32

To Reproduce

Steps to reproduce the behavior:

  1. install cuda 12.2 nightly
  2. mlc_llm serve HF://mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC
  3. crashes with log attached below
Found device: cuda:0
[2024-04-10 00:12:41] INFO auto_device.py:85: Not found device: rocm:0
[2024-04-10 00:12:42] INFO auto_device.py:85: Not found device: metal:0
[2024-04-10 00:12:43] INFO auto_device.py:85: Not found device: vulkan:0
[2024-04-10 00:12:44] INFO auto_device.py:85: Not found device: opencl:0
[2024-04-10 00:12:44] INFO auto_device.py:33: Using device: cuda:0
[2024-04-10 00:12:44] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC
[2024-04-10 00:12:44] INFO download.py:40: [Git] Cloning https://huggingface.co/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC.git to /tmp/tmpgtd319_l/tmp
[2024-04-10 00:12:45] INFO download.py:76: [Git LFS] Downloading 1 files with Git LFS: ['tokenizer.model']
[2024-04-10 00:12:45] INFO download.py:79: [Git LFS] Downloading tokenizer.model
100%|██████████| 1/1 [00:00<00:00,  1.83it/s]
[2024-04-10 00:12:47] INFO download.py:152: Downloaded https://huggingface.co/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC/resolve/main/params_shard_1.bin to /tmp/tmpgtd319_l/tmp/params_shard_1.bin
...

[2024-04-10 00:13:29] INFO download.py:152: Downloaded https://huggingface.co/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC/resolve/main/params_shard_112.bin to /tmp/tmpgtd319_l/tmp/params_shard_112.bin
100%|██████████| 115/115 [00:43<00:00,  2.62it/s]
[2024-04-10 00:13:29] INFO download.py:153: Moving /tmp/tmpgtd319_l/tmp to /root/.cache/mlc_llm/model_weights/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC
Traceback (most recent call last):
  File "/usr/local/bin/mlc_llm", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 41, in main
    cli.main(sys.argv[2:])
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/serve.py", line 75, in main
    serve(
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/serve.py", line 42, in serve
    engine = async_engine.AsyncThreadedEngine(
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/async_engine.py", line 274, in __init__
    ) = _process_model_args(models)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/engine.py", line 125, in _process_model_args
    model_args: List[Any] = sum(
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/engine.py", line 126, in <genexpr>
    (_convert_model_info(model) for model in models),
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/engine.py", line 101, in _convert_model_info
    assert isinstance(chat_config.conv_template, Conversation)
AssertionError

Expected behavior

Should work as with Llama and Mistral and Gemma.

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA 12.2
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 20.04lts
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) 3060 12 GB
  • How you installed MLC-LLM (conda, source): nightly
  • How you installed TVM-Unity (pip, source): nightly prevuilt
  • Python version (e.g. 3.10): 3.11
  • GPU driver version (if applicable):
  • CUDA/cuDNN version (if applicable):
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
  • Any other relevant information:

Additional context

Sing-Li avatar Apr 10 '24 00:04 Sing-Li

Thank you @Sing-Li for reporting! That is because the mlc-chat-config.json in the prebuilt weight repo was not updated.

I just updated the conv_template field https://huggingface.co/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC/commit/e83c4a2bbb4735c1ccde096dae0df635dd172310 and I think it should be good now. Would you mind trying again?

MasterJH5574 avatar Apr 10 '24 14:04 MasterJH5574

Thank you @MasterJH5574 It works fine now. Closing the issue.

Sing-Li avatar Apr 10 '24 15:04 Sing-Li

Sorry, @MasterJH5574 Is it possible to update the configs for the other two gorilla function weights as well 🙏

https://huggingface.co/mlc-ai/gorilla-openfunctions-v2-q4f32_1-MLC

https://huggingface.co/mlc-ai/gorilla-openfunctions-v2-q4f16_1-MLC

Sing-Li avatar Apr 10 '24 16:04 Sing-Li

Hey @Sing-Li, sorry for the late reply. Just updated these two repositories. If I remember correctly, there might still be some output formatting issue for the function calling of gorilla v2. Could you try a bit at your convenience and see how it goes?

MasterJH5574 avatar Apr 15 '24 13:04 MasterJH5574

Thanks @MasterJH5574

Test results:
gorilla-openfunctions-v2-q4f32_1

  • chat - seems to work
  • serve - I only have 12GB VRAM and serve ran out of memory

gorilla-openfunctions-v2-q4f16_1

  • chat - crashes with the following dump
[2024-04-16 04:10:14] INFO estimate_memory_usage.py:57: [Memory usage] Function `sampler_take_probs`: 0.00 MB
[2024-04-16 04:10:14] INFO estimate_memory_usage.py:57: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-04-16 04:10:14] INFO pipeline.py:50: Compiling external modules
[2024-04-16 04:10:14] INFO pipeline.py:50: Compilation complete! Exporting to disk
[2024-04-16 04:10:31] INFO model_metadata.py:96: Total memory usage: 4169.98 MB (Parameters: 3707.35 MB. KVCache: 0.00 MB. Temporary buffer: 462.62 MB)
[2024-04-16 04:10:31] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-04-16 04:10:31] INFO compile.py:198: Generated: /tmp/tmphmrwlwhl/lib.so
[2024-04-16 04:10:31] INFO jit.py:98: Using compiled model lib: /root/.cache/mlc_llm/model_lib/5c413127c1217b4fc4779c7be427b220.so
[2024-04-16 04:10:32] INFO model_metadata.py:96: Total memory usage: 4169.98 MB (Parameters: 3707.35 MB. KVCache: 0.00 MB. Temporary buffer: 462.62 MB)
[2024-04-16 04:10:32] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
You can use the following special commands:
 /help               print the special commands
 /exit               quit the cli
 /stats              print out the latest stats (token/sec)
 /reset              restart a fresh chat
 /set [overrides]    override settings in the generation config. For example,
                     `/set temperature=0.5;max_gen_len=100;stop=end,stop`
                     Note: Separate stop words in the `stop` option with commas (,).
 Multi-line input: Use escape+enter to start a new line.

Traceback (most recent call last):
 File "/usr/local/bin/mlc_llm", line 8, in <module>
   sys.exit(main())
 File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 37, in main
   cli.main(sys.argv[2:])
 File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/chat.py", line 41, in main
   chat(
 File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/chat.py", line 135, in chat
   cm._process_system_prompts()  # pylint: disable=protected-access
 File "/usr/local/lib/python3.10/dist-packages/mlc_llm/chat_module.py", line 1228, in _process_system_prompts
   self._process_system_prompts_func()
 File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
 File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
 File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
 File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
 File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
   raise py_err
tvm._ffi.base.TVMError: TVMError: Unsupported layout: 0

running serve also crashes with same error when a REST completion requests comes in:

[2024-04-16 04:11:59] INFO auto_device.py:76: Found device: cuda:0
[2024-04-16 04:12:00] INFO auto_device.py:85: Not found device: rocm:0
[2024-04-16 04:12:01] INFO auto_device.py:85: Not found device: metal:0
[2024-04-16 04:12:02] INFO auto_device.py:85: Not found device: vulkan:0
[2024-04-16 04:12:03] INFO auto_device.py:85: Not found device: opencl:0
[2024-04-16 04:12:03] INFO auto_device.py:33: Using device: cuda:0
[2024-04-16 04:12:03] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/gorilla-openfunctions-v2-q4f16_1-MLC
[2024-04-16 04:12:03] INFO download.py:131: Weights already downloaded: /root/.cache/mlc_llm/model_weights/mlc-ai/gorilla-openfunctions-v2-q4f16_1-MLC
[2024-04-16 04:12:03] INFO jit.py:35: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-04-16 04:12:03] INFO jit.py:117: Using cached model lib: /root/.cache/mlc_llm/model_lib/5c413127c1217b4fc4779c7be427b220.so
[2024-04-16 04:12:05] INFO engine_base.py:241: Estimated KVCacheConfig "max_total_sequence_length": 13445.
[2024-04-16 04:12:05] INFO engine_base.py:246: Estimated total single GPU memory usage: 10839.99 MB (Parameters: 3707.35 MB. KVCache: 6479.40 MB. Temporary buffer: 653.24 MB)
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/engine_base.py", line 602, in _background_loop
    self._ffi["run_background_loop"]()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: TVMError: Unsupported layout: 0

Sing-Li avatar Apr 16 '24 04:04 Sing-Li

Thank you @Sing-Li for checking again. This issue https://github.com/mlc-ai/mlc-llm/issues/2121#issuecomment-2049258529 also reports the similar error. We will look into that.

MasterJH5574 avatar Apr 16 '24 04:04 MasterJH5574

Hi @Sing-Li @ollmer, we have fixed this issue in the latest pip package. Please update the packages and try again, thank you!

MasterJH5574 avatar Apr 19 '24 05:04 MasterJH5574