mlc-llm
mlc-llm copied to clipboard
[Bug] WSL2 Ubuntu RTX 3060 CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU
🐛 Bug
When attempting to run LLama 2 7B locally, I receive the error: CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU
To Reproduce
Steps to reproduce the behavior:
- Use WSL2 with Ubuntu
- Setup CUDA for RTX 3060 using 12.2 cuda toolkit
- Follow instructions in documents for cuda 12.2 on linux: https://llm.mlc.ai/docs/install/mlc_llm.html
- Follow getting started guide: https://llm.mlc.ai/docs/
- Run sample code provided in getting started guide
Full error output:
(mlc-chat-env) (base) akagi@BLD:~/mlc-llm-testing$ /home/akagi/miniconda3/envs/mlc-chat-env/bin/python /home/akagi/mlc-llm-testing/initial-run-local.py
[2024-02-29 12:34:11] INFO auto_device.py:76: Found device: cuda:0
[2024-02-29 12:34:12] INFO auto_device.py:85: Not found device: rocm:0
[2024-02-29 12:34:13] INFO auto_device.py:85: Not found device: metal:0
[2024-02-29 12:34:15] INFO auto_device.py:85: Not found device: vulkan:0
[2024-02-29 12:34:16] INFO auto_device.py:85: Not found device: opencl:0
[2024-02-29 12:34:16] INFO auto_device.py:33: Using device: cuda:0
[2024-02-29 12:34:16] INFO chat_module.py:373: Using model folder: /home/akagi/mlc-llm-testing/dist/Llama-2-7b-chat-hf-q4f16_1-MLC
[2024-02-29 12:34:16] INFO chat_module.py:374: Using mlc chat config: /home/akagi/mlc-llm-testing/dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json
[2024-02-29 12:34:16] INFO chat_module.py:516: Using library model: dist/prebuilt_libs/Llama-2-7b-chat-hf/Llama-2-7b-chat-hf-q4f16_1-cuda.so
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 182, in <module>
main()
File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 176, in main
_report_memory_usage(metadata, cfg)
File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 93, in _report_memory_usage
params_bytes, temp_func_bytes, kv_cache_bytes = _compute_memory_usage(metadata, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 87, in _compute_memory_usage
kv_cache_bytes = metadata["kv_cache_bytes"]
~~~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'kv_cache_bytes'
Traceback (most recent call last):
File "/home/akagi/mlc-llm-testing/initial-run-local.py", line 11, in <module>
cm.generate(prompt="What is the meaning of life?", progress_callback=StreamToStdout(callback_interval=2))
File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/mlc_chat/chat_module.py", line 850, in generate
self._prefill(prompt, generation_config=generation_config)
File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/mlc_chat/chat_module.py", line 1072, in _prefill
self._prefill_func(
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
3: _ZN3tvm7runtime13PackedFuncObj9ExtractorINS0_16PackedFuncSubObjIZNS0_6detail17PackFuncVoidAddr_ILi8ENS0_15CUDAWrappedFuncEEENS0_10PackedFuncET0_RKSt6vectorINS4_1
2: tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const [clone .isra.0]
1: tvm::runtime::CUDAModuleNode::GetFunc(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
0: _ZN3tvm7runtime6deta
File "/workspace/tvm/src/runtime/cuda/cuda_module.cc", line 110
CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU
Expected behavior
I expect the model to run on my GPU locally within WSL.
Environment
- Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
- Operating system (e.g. Ubuntu/Windows/MacOS/...): WSL - Ubuntu
- Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): RTX 3060
- How you installed MLC-LLM (
conda
, source):python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu122 mlc-ai-nightly-cu122
- How you installed TVM-Unity (
pip
, source): N/A - using prebuilt model - Python version (e.g. 3.10): 3.11.8
- GPU driver version (if applicable): 551.61
- CUDA/cuDNN version (if applicable): 12.4
- TVM Unity Hash Tag (
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models): N/A - Any other relevant information: cuda toolkit is 12.2 -
(mlc-chat-env) akagi@BLD:~/mlc-llm-testing$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0
Additional context
- I did a test on cuda samples to make sure it wasn't a CUDA configuration issue. It was able to pass a simple test:
User
(mlc-chat-venv) akagi@BLD:/usr/local/cuda-12.2/cuda-samples/Samples/0_Introduction/vectorAdd$ ./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
- I am testing using Llama-2-7b-chat-hf-q4f16_1-MLC as defined in the docs
Hi, I have met the same issue. Does anyone have any idea about this?
please try ouot the latets command in https://llm.mlc.ai/docs/get_started/quick_start.html