mlc-llm
mlc-llm copied to clipboard
[Bug] gorilla-openfunctions-v1-q4f16_1-MLC crashes on JIT lib build on cuda12.2
🐛 Bug
Trying to serve gorilla openfunctions v1 will crash during initial jit library build. Same happens with openfunctions v2 and f16 or f32
To Reproduce
Steps to reproduce the behavior:
- install cuda 12.2 nightly
-
mlc_llm serve HF://mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC
- crashes with log attached below
Found device: cuda:0
[2024-04-10 00:12:41] INFO auto_device.py:85: Not found device: rocm:0
[2024-04-10 00:12:42] INFO auto_device.py:85: Not found device: metal:0
[2024-04-10 00:12:43] INFO auto_device.py:85: Not found device: vulkan:0
[2024-04-10 00:12:44] INFO auto_device.py:85: Not found device: opencl:0
[2024-04-10 00:12:44] INFO auto_device.py:33: Using device: cuda:0
[2024-04-10 00:12:44] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC
[2024-04-10 00:12:44] INFO download.py:40: [Git] Cloning https://huggingface.co/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC.git to /tmp/tmpgtd319_l/tmp
[2024-04-10 00:12:45] INFO download.py:76: [Git LFS] Downloading 1 files with Git LFS: ['tokenizer.model']
[2024-04-10 00:12:45] INFO download.py:79: [Git LFS] Downloading tokenizer.model
100%|██████████| 1/1 [00:00<00:00, 1.83it/s]
[2024-04-10 00:12:47] INFO download.py:152: Downloaded https://huggingface.co/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC/resolve/main/params_shard_1.bin to /tmp/tmpgtd319_l/tmp/params_shard_1.bin
...
[2024-04-10 00:13:29] INFO download.py:152: Downloaded https://huggingface.co/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC/resolve/main/params_shard_112.bin to /tmp/tmpgtd319_l/tmp/params_shard_112.bin
100%|██████████| 115/115 [00:43<00:00, 2.62it/s]
[2024-04-10 00:13:29] INFO download.py:153: Moving /tmp/tmpgtd319_l/tmp to /root/.cache/mlc_llm/model_weights/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC
Traceback (most recent call last):
File "/usr/local/bin/mlc_llm", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 41, in main
cli.main(sys.argv[2:])
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/serve.py", line 75, in main
serve(
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/serve.py", line 42, in serve
engine = async_engine.AsyncThreadedEngine(
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/async_engine.py", line 274, in __init__
) = _process_model_args(models)
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/engine.py", line 125, in _process_model_args
model_args: List[Any] = sum(
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/engine.py", line 126, in <genexpr>
(_convert_model_info(model) for model in models),
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/engine.py", line 101, in _convert_model_info
assert isinstance(chat_config.conv_template, Conversation)
AssertionError
Expected behavior
Should work as with Llama and Mistral and Gemma.
Environment
- Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA 12.2
- Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 20.04lts
- Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) 3060 12 GB
- How you installed MLC-LLM (
conda
, source): nightly - How you installed TVM-Unity (
pip
, source): nightly prevuilt - Python version (e.g. 3.10): 3.11
- GPU driver version (if applicable):
- CUDA/cuDNN version (if applicable):
- TVM Unity Hash Tag (
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models): - Any other relevant information:
Additional context
Thank you @Sing-Li for reporting! That is because the mlc-chat-config.json
in the prebuilt weight repo was not updated.
I just updated the conv_template
field https://huggingface.co/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC/commit/e83c4a2bbb4735c1ccde096dae0df635dd172310 and I think it should be good now. Would you mind trying again?
Thank you @MasterJH5574 It works fine now. Closing the issue.
Sorry, @MasterJH5574 Is it possible to update the configs for the other two gorilla function weights as well 🙏
https://huggingface.co/mlc-ai/gorilla-openfunctions-v2-q4f32_1-MLC
https://huggingface.co/mlc-ai/gorilla-openfunctions-v2-q4f16_1-MLC
Hey @Sing-Li, sorry for the late reply. Just updated these two repositories. If I remember correctly, there might still be some output formatting issue for the function calling of gorilla v2. Could you try a bit at your convenience and see how it goes?
Thanks @MasterJH5574
Test results:
gorilla-openfunctions-v2-q4f32_1
- chat - seems to work
- serve - I only have 12GB VRAM and
serve
ran out of memory
gorilla-openfunctions-v2-q4f16_1
- chat - crashes with the following dump
[2024-04-16 04:10:14] INFO estimate_memory_usage.py:57: [Memory usage] Function `sampler_take_probs`: 0.00 MB
[2024-04-16 04:10:14] INFO estimate_memory_usage.py:57: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-04-16 04:10:14] INFO pipeline.py:50: Compiling external modules
[2024-04-16 04:10:14] INFO pipeline.py:50: Compilation complete! Exporting to disk
[2024-04-16 04:10:31] INFO model_metadata.py:96: Total memory usage: 4169.98 MB (Parameters: 3707.35 MB. KVCache: 0.00 MB. Temporary buffer: 462.62 MB)
[2024-04-16 04:10:31] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-04-16 04:10:31] INFO compile.py:198: Generated: /tmp/tmphmrwlwhl/lib.so
[2024-04-16 04:10:31] INFO jit.py:98: Using compiled model lib: /root/.cache/mlc_llm/model_lib/5c413127c1217b4fc4779c7be427b220.so
[2024-04-16 04:10:32] INFO model_metadata.py:96: Total memory usage: 4169.98 MB (Parameters: 3707.35 MB. KVCache: 0.00 MB. Temporary buffer: 462.62 MB)
[2024-04-16 04:10:32] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out the latest stats (token/sec)
/reset restart a fresh chat
/set [overrides] override settings in the generation config. For example,
`/set temperature=0.5;max_gen_len=100;stop=end,stop`
Note: Separate stop words in the `stop` option with commas (,).
Multi-line input: Use escape+enter to start a new line.
Traceback (most recent call last):
File "/usr/local/bin/mlc_llm", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 37, in main
cli.main(sys.argv[2:])
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/chat.py", line 41, in main
chat(
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/chat.py", line 135, in chat
cm._process_system_prompts() # pylint: disable=protected-access
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/chat_module.py", line 1228, in _process_system_prompts
self._process_system_prompts_func()
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: TVMError: Unsupported layout: 0
running serve
also crashes with same error when a REST completion requests comes in:
[2024-04-16 04:11:59] INFO auto_device.py:76: Found device: cuda:0
[2024-04-16 04:12:00] INFO auto_device.py:85: Not found device: rocm:0
[2024-04-16 04:12:01] INFO auto_device.py:85: Not found device: metal:0
[2024-04-16 04:12:02] INFO auto_device.py:85: Not found device: vulkan:0
[2024-04-16 04:12:03] INFO auto_device.py:85: Not found device: opencl:0
[2024-04-16 04:12:03] INFO auto_device.py:33: Using device: cuda:0
[2024-04-16 04:12:03] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/gorilla-openfunctions-v2-q4f16_1-MLC
[2024-04-16 04:12:03] INFO download.py:131: Weights already downloaded: /root/.cache/mlc_llm/model_weights/mlc-ai/gorilla-openfunctions-v2-q4f16_1-MLC
[2024-04-16 04:12:03] INFO jit.py:35: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-04-16 04:12:03] INFO jit.py:117: Using cached model lib: /root/.cache/mlc_llm/model_lib/5c413127c1217b4fc4779c7be427b220.so
[2024-04-16 04:12:05] INFO engine_base.py:241: Estimated KVCacheConfig "max_total_sequence_length": 13445.
[2024-04-16 04:12:05] INFO engine_base.py:246: Estimated total single GPU memory usage: 10839.99 MB (Parameters: 3707.35 MB. KVCache: 6479.40 MB. Temporary buffer: 653.24 MB)
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/engine_base.py", line 602, in _background_loop
self._ffi["run_background_loop"]()
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: TVMError: Unsupported layout: 0
Thank you @Sing-Li for checking again. This issue https://github.com/mlc-ai/mlc-llm/issues/2121#issuecomment-2049258529 also reports the similar error. We will look into that.
Hi @Sing-Li @ollmer, we have fixed this issue in the latest pip package. Please update the packages and try again, thank you!