jetson-containers
jetson-containers copied to clipboard
Llama-3-8B run inference error occurred:InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 513024 at offset 0 in region that is 163840bytes
🐛 Bug
I use the jetson-containers of MLC and use Meta-Llama-3-8B-Instruct model . after I run ```
python3 -m mlc_llm.build
--model Meta-Llama-3-8B-Instruct-hf
--quantization q4f16_ft
--target cuda
--use-cuda-graph
--use-flash-attn-mqa
--sep-embed
--max-seq-len 8192
--artifact-path /data/models/mlc/dist
--use-safetensors
It completed the quantification and did not report any errors.
but I find it show error when I run inference:
python3 /opt/mlc-llm/benchmark.py
--model /data/models/mlc/dist/Meta-Llama-3-8B-Instruct-hf-ctx8192/Meta-Llama-3-8B-Instruct-hf-q4f16_ft/params
--prompt "Can you tell me a joke about llamas?"
--max-new-tokens 128
The following error occurred:
`Namespace(chat=False, max_new_tokens=128, max_num_prompts=None, model='/data/models/mlc/dist/Meta-Llama-3-8B-Instruct-hf-q4f16_ft/params', model_lib_path=None, prompt=['Can you tell me a joke about llamas?'], save='', streaming=False) -- loading /data/models/mlc/dist/Meta-Llama-3-8B-Instruct-hf-q4f16_ft/params PROMPT: Can you tell me a joke about llamas? Traceback (most recent call last): File "/opt/mlc-llm/benchmark.py", line 135, in <module> print(cm.benchmark_generate(prompt=prompt, generate_length=args.max_new_tokens).strip()) File "/usr/local/lib/python3.8/dist-packages/mlc_chat/chat_module.py", line 910, in benchmark_generate self._prefill(prompt) File "/usr/local/lib/python3.8/dist-packages/mlc_chat/chat_module.py", line 997, in _prefill self._prefill_func( File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__ File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL File "/usr/local/lib/python3.8/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err tvm.error.InternalError: Traceback (most recent call last): [bt] (8) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)+0x230) [0xffff6c51f6c8] [bt] (7) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()+0x210) [0xffff6c51dd58] [bt] (6) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)+0x5e4) [0xffff6c51e5bc] [bt] (5) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x7c) [0xffff6c51c9fc] [bt] (4) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::NDArray (tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}>(tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)+0x10) [0xffff6c4ea638] [bt] (3) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::TypedPackedFunc<tvm::runtime::NDArray (tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}>(tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const, tvm::runtime::TVMRetValue) const+0x27c) [0xffff6c4ea374] [bt] (2) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::memory::StorageObj::AllocNDArray(long, tvm::runtime::ShapeTuple, DLDataType)+0x3a8) [0xffff6c4998c8] [bt] (1) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x78) [0xffff6a0edf58] [bt] (0) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff6c4966f0] File "/opt/mlc-llm/3rdparty/tvm/src/runtime/memory/memory_manager.cc", line 108 InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 513024 at offset 0 in region that is 163840bytes `
I tried with different max-seq-len. it returns the same. It's worth mentioning that when I use Meta-Llama-2-7b for quantization and then run inference, there are no errors. However, when using Meta-Llama-3-8B or Meta-Llama-3-8B-Instruct for inference, the same errors as mentioned above occur.
What should I do?
Thanks.
## Environment
- Platform (jetson orin)