mlc-llm dolly 12b 3bit cuda out of memory on my wsl 3070 laptop card

mlc_chat_cli --model dolly-v2-12b_int3 --dtype float32 Use lib /root/mlcai/dist/dolly-v2-12b_int3/float32/dolly-v2-12b_int3_cuda_float32.so Initializing the chat module... Finish loading You can use the following special commands: /help print the special commands /exit quit the cli /stats print out the latest stats (token/sec) /reset restart a fresh chat

Instruction: hello

Response: [13:55:51] /root/mlcai/relax/src/runtime/relax_vm/pooled_allocator.h:64: Warning: PooledAllocator got InternalError during allocation:

An error occurred during the execution of TVM. For more information, please see: https://tvm.apache.org/docs/errors.html

Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory [13:55:51] /root/mlcai/relax/src/runtime/relax_vm/pooled_allocator.h:65: Warning: Trying to release all unused memory and reallocate... terminate called after throwing an instance of 'tvm::runtime::InternalError' what(): [13:55:51] /root/mlcai/relax/include/tvm/runtime/device_api.h:291: unknown type =0 Stack trace: 0: _ZN3tvm7runtime8relax_vm13MemoryManager 1: _ZN3tvm7runtime18SimpleObjAllocator7HandlerIN 2: tvm::runtime::relax_vm::VMAllocStorage(void*, tvm::runtime::ShapeTuple, long, DLDataType) [clone .cold.318] 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::relax_vm::Storage (void*, tvm::runtime::ShapeTuple, long, DLDataType)>::AssignTypedLambda<tvm::runtime::relax_vm::Storage ()(void, tvm::runtime::ShapeTuple, long, DLDataType)>(tvm::runtime::relax_vm::Storage ()(void, tvm::runtime::ShapeTuple, long, DLDataType), std::__cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 5: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction) 6: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop() 7: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocatortvm::runtime::TVMRetValue > const&) 8: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 9: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 10: mlc::llm::LLMChatModule::Forward(tvm::runtime::NDArray, long) 11: mlc::llm::LLMChatModule::EncodeStep(std::__cxx11::basic_string<char, std::char_traits, std::allocator >) 12: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<mlc::llm::LLMChatModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#3}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 13: Chat(tvm::runtime::Module, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, long, double, double, long, int, int, double) 14: main 15: 0x00007f2565fbb78f 16: __libc_start_main 17: 0x000055d148d5c314

my gpu memory is 8gb,i think it is enough to run this model,here is vram usage after load,it runs on wsl2

May 04 '23 06:05 myhyh

my build command is python build.py --model dolly-v2-12b --dtype float32 --target cuda --quantization-mode int3 --quantization-sym --quantization-storage-nbit 32 --max-seq-len 2048

May 04 '23 06:05 myhyh

with this mod in code to fit my card arch

May 04 '23 06:05 myhyh

Let me preface this by saying I have no idea what I’m talking about 😂

BUT… could it be because you’re using float32 instead of float16?

https://huggingface.co/databricks/dolly-v2-12b/discussions/18

May 05 '23 01:05 zeeroh

Quantization plays an important role in memory reduction if you wanted to run a larger model with consumer-class GPUs, so please turn it on :-)

May 08 '23 22:05 junrushao

mlc-llm mlc-llm copied to clipboard

dolly 12b 3bit cuda out of memory on my wsl 3070 laptop card

Instruction: hello

Response: [13:55:51] /root/mlcai/relax/src/runtime/relax_vm/pooled_allocator.h:64: Warning: PooledAllocator got InternalError during allocation:

An error occurred during the execution of TVM. For more information, please see: https://tvm.apache.org/docs/errors.html

mlc-llm
mlc-llm copied to clipboard