jetson-containers Llama-3-8B run inference error occurred：InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 513024 at offset 0 in region that is 163840bytes

Llama-3-8B run inference error occurred：InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 513024 at offset 0 in region that is 163840bytes

Open mylinfh opened this issue 8 months ago • 4 comments

🐛 Bug

I use the jetson-containers of MLC and use Meta-Llama-3-8B-Instruct model . after I run ``` python3 -m mlc_llm.build
--model Meta-Llama-3-8B-Instruct-hf
--quantization q4f16_ft
--target cuda
--use-cuda-graph
--use-flash-attn-mqa
--sep-embed
--max-seq-len 8192
--artifact-path /data/models/mlc/dist
--use-safetensors

It completed the quantification and did not report any errors.
but I  find it show error when I run inference:

python3 /opt/mlc-llm/benchmark.py
--model /data/models/mlc/dist/Meta-Llama-3-8B-Instruct-hf-ctx8192/Meta-Llama-3-8B-Instruct-hf-q4f16_ft/params
--prompt "Can you tell me a joke about llamas?"
--max-new-tokens 128

The following error occurred:
`Namespace(chat=False, max_new_tokens=128, max_num_prompts=None, model='/data/models/mlc/dist/Meta-Llama-3-8B-Instruct-hf-q4f16_ft/params', model_lib_path=None, prompt=['Can you tell me a joke about llamas?'], save='', streaming=False) -- loading /data/models/mlc/dist/Meta-Llama-3-8B-Instruct-hf-q4f16_ft/params  PROMPT:  Can you tell me a joke about llamas?  Traceback (most recent call last):   File "/opt/mlc-llm/benchmark.py", line 135, in <module>     print(cm.benchmark_generate(prompt=prompt, generate_length=args.max_new_tokens).strip())   File "/usr/local/lib/python3.8/dist-packages/mlc_chat/chat_module.py", line 910, in benchmark_generate     self._prefill(prompt)   File "/usr/local/lib/python3.8/dist-packages/mlc_chat/chat_module.py", line 997, in _prefill     self._prefill_func(   File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__   File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall   File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL   File "/usr/local/lib/python3.8/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error     raise py_err tvm.error.InternalError: Traceback (most recent call last):   [bt] (8) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)+0x230) [0xffff6c51f6c8]   [bt] (7) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()+0x210) [0xffff6c51dd58]   [bt] (6) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)+0x5e4) [0xffff6c51e5bc]   [bt] (5) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x7c) [0xffff6c51c9fc]   [bt] (4) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::NDArray (tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}>(tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)+0x10) [0xffff6c4ea638]   [bt] (3) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::TypedPackedFunc<tvm::runtime::NDArray (tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}>(tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const, tvm::runtime::TVMRetValue) const+0x27c) [0xffff6c4ea374]   [bt] (2) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::memory::StorageObj::AllocNDArray(long, tvm::runtime::ShapeTuple, DLDataType)+0x3a8) [0xffff6c4998c8]   [bt] (1) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x78) [0xffff6a0edf58]   [bt] (0) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff6c4966f0]   File "/opt/mlc-llm/3rdparty/tvm/src/runtime/memory/memory_manager.cc", line 108 InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 513024 at offset 0 in region that is 163840bytes `

I tried with different max-seq-len. it returns the same. It's worth mentioning that when I use Meta-Llama-2-7b for quantization and then run inference, there are no errors. However, when using Meta-Llama-3-8B or Meta-Llama-3-8B-Instruct for inference, the same errors as mentioned above occur.

What should I do?
Thanks.


## Environment

 - Platform (jetson orin)

Jun 17 '24 01:06 mylinfh

jetson-containers jetson-containers copied to clipboard

Llama-3-8B run inference error occurred：InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 513024 at offset 0 in region that is 163840bytes

🐛 Bug

jetson-containers
jetson-containers copied to clipboard