mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

GPU Usage dropping before completion ends

Open jeanromainroy opened this issue 4 months ago • 14 comments

I have been using the new Command-R+ model in 4-bit mode and consistently observe a drop in GPU utilization immediately after prompt evaluation, as it begins generation/prediction. This leads to significantly reduced performance.

During evaluation: Screenshot 2024-04-09 at 12 59 42 PM

During generation – drop occurs right before the first token is predicted (i.e. "<PAD>"): Screenshot 2024-04-09 at 1 01 04 PM

Here's my setup: Machine: Apple M2 Ultra (cores: 8E+16P+60GPU), 192GB Ram ProductName: macOS ProductVersion: 14.3 BuildVersion: 23D56

I have tried with and without setting my memory limit: sudo sysctl iogpu.wired_lwm_mb=150000

I have tried with and without disabling the cache: mx.metal.set_cache_limit(0)

Any help would be welcome, because at the moment I am only able to use the llama.cpp implementation of Command-R+, which works without any issues.

jeanromainroy avatar Apr 09 '24 17:04 jeanromainroy