mlc-llm [Bug] WSL2 Ubuntu RTX 3060 CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR

🐛 Bug

When attempting to run LLama 2 7B locally, I receive the error: CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU

To Reproduce

Steps to reproduce the behavior:

Use WSL2 with Ubuntu
Setup CUDA for RTX 3060 using 12.2 cuda toolkit
Follow instructions in documents for cuda 12.2 on linux: https://llm.mlc.ai/docs/install/mlc_llm.html
Follow getting started guide: https://llm.mlc.ai/docs/
Run sample code provided in getting started guide

Full error output:

(mlc-chat-env) (base) akagi@BLD:~/mlc-llm-testing$ /home/akagi/miniconda3/envs/mlc-chat-env/bin/python /home/akagi/mlc-llm-testing/initial-run-local.py
[2024-02-29 12:34:11] INFO auto_device.py:76: Found device: cuda:0
[2024-02-29 12:34:12] INFO auto_device.py:85: Not found device: rocm:0
[2024-02-29 12:34:13] INFO auto_device.py:85: Not found device: metal:0
[2024-02-29 12:34:15] INFO auto_device.py:85: Not found device: vulkan:0
[2024-02-29 12:34:16] INFO auto_device.py:85: Not found device: opencl:0
[2024-02-29 12:34:16] INFO auto_device.py:33: Using device: cuda:0
[2024-02-29 12:34:16] INFO chat_module.py:373: Using model folder: /home/akagi/mlc-llm-testing/dist/Llama-2-7b-chat-hf-q4f16_1-MLC
[2024-02-29 12:34:16] INFO chat_module.py:374: Using mlc chat config: /home/akagi/mlc-llm-testing/dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json
[2024-02-29 12:34:16] INFO chat_module.py:516: Using library model: dist/prebuilt_libs/Llama-2-7b-chat-hf/Llama-2-7b-chat-hf-q4f16_1-cuda.so
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 182, in <module>
    main()
  File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 176, in main
    _report_memory_usage(metadata, cfg)
  File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 93, in _report_memory_usage
    params_bytes, temp_func_bytes, kv_cache_bytes = _compute_memory_usage(metadata, config)
                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/mlc_chat/cli/model_metadata.py", line 87, in _compute_memory_usage
    kv_cache_bytes = metadata["kv_cache_bytes"]
                     ~~~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'kv_cache_bytes'
Traceback (most recent call last):
  File "/home/akagi/mlc-llm-testing/initial-run-local.py", line 11, in <module>
    cm.generate(prompt="What is the meaning of life?", progress_callback=StreamToStdout(callback_interval=2))
  File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/mlc_chat/chat_module.py", line 850, in generate
    self._prefill(prompt, generation_config=generation_config)
  File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/mlc_chat/chat_module.py", line 1072, in _prefill
    self._prefill_func(
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/home/akagi/miniconda3/envs/mlc-chat-env/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  3: _ZN3tvm7runtime13PackedFuncObj9ExtractorINS0_16PackedFuncSubObjIZNS0_6detail17PackFuncVoidAddr_ILi8ENS0_15CUDAWrappedFuncEEENS0_10PackedFuncET0_RKSt6vectorINS4_1
  2: tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const [clone .isra.0]
  1: tvm::runtime::CUDAModuleNode::GetFunc(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  0: _ZN3tvm7runtime6deta
  File "/workspace/tvm/src/runtime/cuda/cuda_module.cc", line 110
CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU

Expected behavior

I expect the model to run on my GPU locally within WSL.

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
Operating system (e.g. Ubuntu/Windows/MacOS/...): WSL - Ubuntu
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): RTX 3060
How you installed MLC-LLM (conda, source): python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu122 mlc-ai-nightly-cu122
How you installed TVM-Unity (pip, source): N/A - using prebuilt model
Python version (e.g. 3.10): 3.11.8
GPU driver version (if applicable): 551.61
CUDA/cuDNN version (if applicable): 12.4
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): N/A
Any other relevant information: cuda toolkit is 12.2 - (mlc-chat-env) akagi@BLD:~/mlc-llm-testing$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0

Additional context

I did a test on cuda samples to make sure it wasn't a CUDA configuration issue. It was able to pass a simple test:

User
(mlc-chat-venv) akagi@BLD:/usr/local/cuda-12.2/cuda-samples/Samples/0_Introduction/vectorAdd$ ./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

I am testing using Llama-2-7b-chat-hf-q4f16_1-MLC as defined in the docs

Feb 29 '24 18:02 alutterb

Hi, I have met the same issue. Does anyone have any idea about this?

Apr 11 '24 17:04 JamesSand

please try ouot the latets command in https://llm.mlc.ai/docs/get_started/quick_start.html

May 11 '24 02:05 tqchen

mlc-llm
mlc-llm copied to clipboard

[Bug] WSL2 Ubuntu RTX 3060 CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

mlc-llm mlc-llm copied to clipboard

[Bug] WSL2 Ubuntu RTX 3060 CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

mlc-llm
mlc-llm copied to clipboard