mlc-llm
mlc-llm copied to clipboard
dolly 12b 3bit cuda out of memory on my wsl 3070 laptop card
mlc_chat_cli --model dolly-v2-12b_int3 --dtype float32 Use lib /root/mlcai/dist/dolly-v2-12b_int3/float32/dolly-v2-12b_int3_cuda_float32.so Initializing the chat module... Finish loading You can use the following special commands: /help print the special commands /exit quit the cli /stats print out the latest stats (token/sec) /reset restart a fresh chat
Instruction: hello
Response: [13:55:51] /root/mlcai/relax/src/runtime/relax_vm/pooled_allocator.h:64: Warning: PooledAllocator got InternalError during allocation:
An error occurred during the execution of TVM. For more information, please see: https://tvm.apache.org/docs/errors.html
Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory
[13:55:51] /root/mlcai/relax/src/runtime/relax_vm/pooled_allocator.h:65: Warning: Trying to release all unused memory and reallocate...
terminate called after throwing an instance of 'tvm::runtime::InternalError'
what(): [13:55:51] /root/mlcai/relax/include/tvm/runtime/device_api.h:291: unknown type =0
Stack trace:
0: _ZN3tvm7runtime8relax_vm13MemoryManager
1: _ZN3tvm7runtime18SimpleObjAllocator7HandlerIN
2: tvm::runtime::relax_vm::VMAllocStorage(void*, tvm::runtime::ShapeTuple, long, DLDataType) [clone .cold.318]
3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::relax_vm::Storage (void*, tvm::runtime::ShapeTuple, long, DLDataType)>::AssignTypedLambda<tvm::runtime::relax_vm::Storage ()(void, tvm::runtime::ShapeTuple, long, DLDataType)>(tvm::runtime::relax_vm::Storage ()(void, tvm::runtime::ShapeTuple, long, DLDataType), std::__cxx11::basic_string<char, std::char_traits
my gpu memory is 8gb,i think it is enough to run this model,here is vram usage after load,it runs on wsl2

my build command is python build.py --model dolly-v2-12b --dtype float32 --target cuda --quantization-mode int3 --quantization-sym --quantization-storage-nbit 32 --max-seq-len 2048
with this mod in code to fit my card arch
Let me preface this by saying I have no idea what I’m talking about 😂
BUT… could it be because you’re using float32 instead of float16?
https://huggingface.co/databricks/dolly-v2-12b/discussions/18
Quantization plays an important role in memory reduction if you wanted to run a larger model with consumer-class GPUs, so please turn it on :-)