mlc-llm icon indicating copy to clipboard operation
mlc-llm copied to clipboard

gpu Memory utilisation Error, i have seen same error in 12GB & 8GB RAM device for llama3.1 and gemma 9b quantization q4f16_1

Open Vinaysukhesh98 opened this issue 1 year ago • 3 comments

mlc-llm/cpp/serve/threaded_engine.cc:283: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 4762.535 MB, which is less than the sum of model weight size (4958.468 MB) and temporary buffer size (609.312 MB). 1. You can set a larger "gpu_memory_utilization" value. 2. If the model weight size is too large, please enable tensor parallelism by passing --tensor-parallel-shards $NGPU to mlc_llm gen_config or use quantization. 3. If the temporary buffer size is too large, please use a smaller --prefill-chunk-size in mlc_llm gen_config. 2024-08-05 18:08:52.206 12017-12060 AndroidRuntime ai.mlc.mlcchat E FATAL EXCEPTION: Thread-8 Process: ai.mlc.mlcchat, PID: 12017 org.apache.tvm.Base$TVMError: TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 4762.535 MB, which is less than the sum of model weight size (4958.468 MB) and temporary buffer size (609.312 MB). 1. You can set a larger "gpu_memory_utilization" value. 2. If the model weight size is too large, please enable tensor parallelism by passing --tensor-parallel-shards $NGPU to mlc_llm gen_config or use quantization. 3. If the temporary buffer size is too large, please use a smaller --prefill-chunk-size in mlc_llm gen_config. Stack trace: File "/Downloads/mlc-llm/cpp/serve/threaded_engine.cc", line 283

                                                                                                	at org.apache.tvm.Base.checkCall(Base.java:173)
                                                                                                	at org.apache.tvm.Function.invoke(Function.java:130)
                                                                                                	at ai.mlc.mlcllm.JSONFFIEngine.runBackgroundLoop(JSONFFIEngine.java:64)
                                                                                                	at ai.mlc.mlcllm.MLCEngine$backgroundWorker$1.invoke(MLCEngine.kt:42)
                                                                                                	at ai.mlc.mlcllm.MLCEngine$backgroundWorker$1.invoke(MLCEngine.kt:40)
                                                                                                	at ai.mlc.mlcllm.BackgroundWorker$start$1.invoke(MLCEngine.kt:19)
                                                                                                	at ai.mlc.mlcllm.BackgroundWorker$start$1.invoke(MLCEngine.kt:18)
                                                                                                	at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)

Vinaysukhesh98 avatar Aug 05 '24 12:08 Vinaysukhesh98

Not sure but some android vendor OS only provides limited memory for GPU but not all of the DRAM

Hzfengsy avatar Aug 05 '24 14:08 Hzfengsy

@Hzfengsy if gpu not available then we can offload the amount of load to cpu?

Vinaysukhesh98 avatar Aug 06 '24 04:08 Vinaysukhesh98

On Android devices, the GPU memory available to OpenCL is usually not all of the phone's DRAM. You can try to split the model and compile it in pieces. I tried the CPU runtime but I couldn't optimize it well.

shifeiwen avatar Aug 06 '24 05:08 shifeiwen