[BUG] Neural network got ~65% slower since MLX v0.23.1
Describe the bug We have a neural network that we want to release for inference. The latest version of mlx is slower than the previous ones by around 15% on our whole model. https://github.com/ml-explore/mlx/pull/1950 was a first fix, but we have still a slowdown. I attempted again to reduce the network to track down the root cause. I arrived to a minimal reproducible example which shows the slowdown starting in 0.23.1.
To Reproduce
Include code snippet
# /// script
# requires-python = "==3.12.9"
# dependencies = []
#
# ///
import time
import mlx.core as mx
import mlx.nn as nn
class LmGen(nn.Module):
def __init__(self):
self.gen_sequence = mx.full(
shape=(1, 1),
vals=-2,
dtype=mx.int32
)
self.text_emb = nn.Embedding(32001, 4096)
self.layers = [
nn.Sequential(
nn.Linear(4096, 512, bias=False),
nn.ReLU(),
nn.Linear(512, 4096, bias=False),
)
for _ in range(32)
]
def step(self):
xs = self.text_emb(self.gen_sequence)
for layer in self.layers:
xs = layer(xs)
return xs
def main():
WARMUP = 5
TOTAL_STEPS = 100
gen = LmGen()
gen.set_dtype(mx.bfloat16)
nn.quantize(gen, bits=4, group_size=32)
sum_times = 0
for i in range(100):
data = mx.arange(8, dtype=mx.uint32)
uploaded_image_embeddings = mx.arange(1152000, dtype=mx.bfloat16)
mx.eval((data, uploaded_image_embeddings))
t1 = time.time()
mx.eval(gen.step())
t2 = time.time()
if i >= 5:
sum_times += t2 - t1
print(f"average time per step: {(sum_times / (TOTAL_STEPS - WARMUP)) * 1000:1f} ms")
main()
$ uv run --with mlx==0.22.1 something.py
average time per step: 1.492164 ms
$ uv run --with mlx==0.23.1 something.py
average time per step: 2.658959 ms
$ CMAKE_BUILD_PARALLEL_LEVEL=16 uv run --with git+https://github.com/ml-explore/mlx#2770a1024082eb10cce6bc0ac589ad089e7be611 something.py
average time per step: 2.870063 ms
Expected behavior The speed should be the same or similar as mlx versions are increasing.
Desktop (please complete the following information):
ProductName: macOS
ProductVersion: 15.3.1
BuildVersion: 24D70
Model Name: MacBook Air
Model Identifier: Mac15,12
Model Number: MXCV3FN/A
Chip: Apple M3
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 16 GB
System Firmware Version: 11881.81.4
OS Loader Version: 11881.81.4
Serial Number (system): MW9GK71RY5
Hardware UUID: 810DA0DC-BEF2-5453-848C-AE07236C3260
Provisioning UDID: 00008122-001A089C2129001C
Activation Lock Status: Disabled
Can you try running on 0.23.1 or higher with these two env variables set:
MLX_MAX_OPS_PER_BUFFER=8 MLX_MAX_MB_PER_BUFFER=1000000 python uv run --with mlx==0.23.1 something.py
Thank you, adding the two environment variables fixes the performance issue, both for the small example and for the whole model. Sadly I didn't find anything in the docs related to those variables. Is there something I need to know on the subject to choose the right values?
Is there something I need to know on the subject to choose the right values?
Ideally not. We want to set these so they work reasonably well for the given platform. They are tuned for different GPU sizes.
Can you share the output of this?
python -c "import mlx.core as mx; print(mx.metal.device_info())"
Here is is, I hope it helps:
{'architecture': 'applegpu_g15g', 'max_buffer_length': 8589934592, 'max_recommended_working_set_size': 11453251584, 'memory_size': 17179869184, 'resource_limit': 499000}
Thanks for all the support!
Would you mind sharing one more thing:
ioreg -l | grep gpu-core-count
We may need to do more fine-grained settings for those variables based on the GPU core count / memory size. The air and pro are both g15g but I the values are better for the pro and too high for the air.
Here it is:
| | | | "gpu-core-count" = 10