gpu Memory utilisation Error, i have seen same error in 12GB & 8GB RAM device for llama3.1 and gemma 9b quantization q4f16_1
mlc-llm/cpp/serve/threaded_engine.cc:283: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 4762.535 MB, which is less than the sum of model weight size (4958.468 MB) and temporary buffer size (609.312 MB).
1. You can set a larger "gpu_memory_utilization" value.
2. If the model weight size is too large, please enable tensor parallelism by passing --tensor-parallel-shards $NGPU to mlc_llm gen_config or use quantization.
3. If the temporary buffer size is too large, please use a smaller --prefill-chunk-size in mlc_llm gen_config.
2024-08-05 18:08:52.206 12017-12060 AndroidRuntime ai.mlc.mlcchat E FATAL EXCEPTION: Thread-8
Process: ai.mlc.mlcchat, PID: 12017
org.apache.tvm.Base$TVMError: TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 4762.535 MB, which is less than the sum of model weight size (4958.468 MB) and temporary buffer size (609.312 MB).
1. You can set a larger "gpu_memory_utilization" value.
2. If the model weight size is too large, please enable tensor parallelism by passing --tensor-parallel-shards $NGPU to mlc_llm gen_config or use quantization.
3. If the temporary buffer size is too large, please use a smaller --prefill-chunk-size in mlc_llm gen_config.
Stack trace:
File "/Downloads/mlc-llm/cpp/serve/threaded_engine.cc", line 283
at org.apache.tvm.Base.checkCall(Base.java:173)
at org.apache.tvm.Function.invoke(Function.java:130)
at ai.mlc.mlcllm.JSONFFIEngine.runBackgroundLoop(JSONFFIEngine.java:64)
at ai.mlc.mlcllm.MLCEngine$backgroundWorker$1.invoke(MLCEngine.kt:42)
at ai.mlc.mlcllm.MLCEngine$backgroundWorker$1.invoke(MLCEngine.kt:40)
at ai.mlc.mlcllm.BackgroundWorker$start$1.invoke(MLCEngine.kt:19)
at ai.mlc.mlcllm.BackgroundWorker$start$1.invoke(MLCEngine.kt:18)
at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)
Not sure but some android vendor OS only provides limited memory for GPU but not all of the DRAM
@Hzfengsy if gpu not available then we can offload the amount of load to cpu?
On Android devices, the GPU memory available to OpenCL is usually not all of the phone's DRAM. You can try to split the model and compile it in pieces. I tried the CPU runtime but I couldn't optimize it well.