review forward and train benchmarking, small torch network has 3-5x faster Forward?
Convert to ipynb and run: https://gist.github.com/wjessup/ce49625cb551af8663059b93cbfab209
Takeaways:
Smaller network torch is 3-5x faster Forward: (45.5 µs vs 235 µs) and Training: (337 µs vs. 1.22 ms) 640 x 4 layers "large" network = similar performance 6400 x 4 layers "xlarge' network = MLX wins by ~2x Forward: (3.85 ms vs 16.8 ms) Training: (35.6 ms vs. 72 ms ) 12000 x 4: Forward (15.2 ms mlx vs 53.8 ms torch), Training: (133 ms mlx vs 292 ms torch)
MLX wins the most in the forward pass but then loses its advantage in other parts of the training loop. The forward is ~5x faster but the training loop only 2x.
If my benchmarking is accurate would it make sense to create a "use cases" or "when to use" section? Even in the documentation today I recognized that you are using NP to create permutation batches: Line 64: https://github.com/ml-explore/mlx-examples/blob/main/transformer_lm/main.py.
Detailed outputs:
MLX network:
print("internal steps") %timeit -n 100 -r 10 run_mlx_grads_and_loss() %timeit -n 100 -r 10 run_mlx_optimization() %timeit -n 100 -r 10 update_mlx_net_weights() print("run network") %timeit -n 100 -r 10 run_mlx_network() %timeit -n 30 -r 10 run_mlx_network_large() %timeit -n 30 -r 10 run_mlx_network_xlarge() print("full training") %timeit -n 30 -r 4 mlx_training_loop() %timeit -n 30 -r 4 mlx_training_loop_large() %timeit -n 30 -r 4 mlx_training_loop_xlarge() Gives these results:
internal steps 985 µs ± 125 µs per loop (mean ± std. dev. of 10 runs, 100 loops each) 813 µs ± 42.6 µs per loop (mean ± std. dev. of 10 runs, 100 loops each) 12.7 µs ± 229 ns per loop (mean ± std. dev. of 10 runs, 100 loops each) run network 235 µs ± 21 µs per loop (mean ± std. dev. of 10 runs, 100 loops each) 365 µs ± 25.7 µs per loop (mean ± std. dev. of 10 runs, 30 loops each) 3.85 ms ± 291 µs per loop (mean ± std. dev. of 10 runs, 30 loops each) full training 1.22 ms ± 155 µs per loop (mean ± std. dev. of 4 runs, 30 loops each) 1.58 ms ± 554 µs per loop (mean ± std. dev. of 4 runs, 30 loops each) 35.6 ms ± 1.37 ms per loop (mean ± std. dev. of 4 runs, 30 loops each) Torch Network:
%timeit -n 100 -r 10 run_torch_network() %timeit -n 30 -r 4 run_torch_large_network() %timeit -n 30 -r 4 run_torch_xlarge_network() print() %timeit -n 100 -r 10 run_torch_optimizer_step() %timeit -n 10 -r 3 run_torch_large_optimizer_step() %timeit -n 10 -r 3 run_torch_xlarge_optimizer_step() Results:
45.5 µs ± 2.81 µs per loop (mean ± std. dev. of 10 runs, 100 loops each) 433 µs ± 29.3 µs per loop (mean ± std. dev. of 4 runs, 30 loops each) 16.8 ms ± 173 µs per loop (mean ± std. dev. of 4 runs, 30 loops each)
337 µs ± 8.8 µs per loop (mean ± std. dev. of 10 runs, 100 loops each) 1.51 ms ± 49 µs per loop (mean ± std. dev. of 3 runs, 10 loops each) 72 ms ± 2.91 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
Friendly bump. Is there a better way to share the benchmark w/ you?
Hey @wjessup, we are working hard on perf right now. Sorry for not adding commentary to this benchmark. A lot of the work we are doing is likely to improve your benchmark though (and it likely has already improved since v0.0.10). Have you run it since then?
Also @wjessup I notice that you are benchmarking torch on the CPU with MLX on the gpu. You should compare the same device for both. For small ops the CPU will be much faster but for large ops the GPU will be much faster. What we want is that conditioned on the selected device MLX is faster 😄 . I don't think we are there yet for all the cases in you benchmarks but we should be getting closer and should be faster in some already.