SIMD CG transform and related improvements
Hi!
I am trying to write a software 3D rasterizer and want to use nalgebra with SIMD as a math library. Here I tried to add AoSoA SIMD support for Matrix4::transform_point() and friends. I also benchmarked my modifications to make sure that I didn't regressed anything, and to make sure that SIMD support makes sense here at all.
Benchamrks were performed on various CPUs that I have. Here are the results:click to expand
Benchmark results on AMD Ryzen 9 5950X, Linux, compared to previous one:
$ RUSTFLAGS="-C target-cpu=x86-64-v2 -C target-feature=+avx" bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2 time: [317.64 ps 317.78 ps 318.02 ps]
change: [+0.4783% +0.5229% +0.5801%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
mat4_transform_vector3 time: [441.89 ps 442.50 ps 443.11 ps]
change: [+0.2644% +0.3559% +0.4672%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
mat3_transform_point2 time: [316.78 ps 317.11 ps 317.45 ps]
change: [+0.3692% +0.4728% +0.5758%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
mat4_transform_point3 time: [443.89 ps 443.96 ps 444.04 ps]
change: [+0.9537% +1.0151% +1.0630%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
mat3_transform_vector2_x4wide
time: [425.44 ps 425.70 ps 425.89 ps]
mat4_transform_vector3_x4wide
time: [646.75 ps 646.85 ps 647.00 ps]
Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) low severe
1 (1.00%) low mild
1 (1.00%) high mild
3 (3.00%) high severe
mat3_transform_point2_x4wide
time: [422.59 ps 422.71 ps 422.87 ps]
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) high mild
7 (7.00%) high severe
mat4_transform_point3_x4wide
time: [636.52 ps 636.61 ps 636.70 ps]
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
mat4_transform_vector3_no_division
time: [443.83 ps 443.97 ps 444.13 ps]
change: [-0.1875% -0.1171% -0.0495%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low severe
1 (1.00%) low mild
7 (7.00%) high mild
2 (2.00%) high severe
$ bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2 time: [316.72 ps 316.99 ps 317.22 ps]
change: [-0.2793% -0.1789% -0.0874%] (p = 0.00 < 0.05)
Change within noise threshold.
mat4_transform_vector3 time: [441.41 ps 441.49 ps 441.58 ps]
change: [+0.1228% +0.5653% +0.8103%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) low severe
3 (3.00%) high mild
6 (6.00%) high severe
mat3_transform_point2 time: [317.29 ps 317.52 ps 317.74 ps]
change: [+0.1923% +0.2853% +0.3731%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
6 (6.00%) low severe
1 (1.00%) high mild
6 (6.00%) high severe
mat4_transform_point3 time: [441.69 ps 441.79 ps 441.91 ps]
change: [-0.0066% +0.1238% +0.2511%] (p = 0.06 > 0.05)
No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low severe
1 (1.00%) low mild
2 (2.00%) high mild
6 (6.00%) high severe
mat3_transform_vector2_x4wide
time: [430.98 ps 431.02 ps 431.06 ps]
change: [+1.5248% +1.6230% +1.7210%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
mat4_transform_vector3_x4wide
time: [646.64 ps 646.72 ps 646.81 ps]
change: [-0.0589% -0.0249% +0.0012%] (p = 0.11 > 0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
mat3_transform_point2_x4wide
time: [431.82 ps 431.89 ps 431.96 ps]
change: [+2.0913% +2.1413% +2.1815%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low severe
2 (2.00%) low mild
3 (3.00%) high severe
mat4_transform_point3_x4wide
time: [646.37 ps 646.41 ps 646.46 ps]
change: [+1.5411% +1.5729% +1.6038%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low mild
1 (1.00%) high mild
5 (5.00%) high severe
mat4_transform_vector3_no_division
time: [443.87 ps 444.10 ps 444.31 ps]
change: [-0.6030% -0.4238% -0.2411%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
Benchmark results on Intel i7-8565U, Linux, compared to previous one:
$ RUSTFLAGS="-C target-cpu=x86-64-v2 -C target-feature=+avx" bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2 time: [293.12 ps 295.95 ps 298.98 ps]
change: [+1.4817% +2.1154% +2.6804%] (p = 0.00 < 0.05)
Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
15 (15.00%) low severe
1 (1.00%) low mild
1 (1.00%) high severe
mat4_transform_vector3 time: [540.54 ps 540.82 ps 541.10 ps]
change: [-8.7714% -8.5103% -8.2102%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) low severe
1 (1.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
mat3_transform_point2 time: [305.57 ps 305.91 ps 306.43 ps]
change: [+4.2420% +4.6490% +4.9390%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low severe
1 (1.00%) low mild
3 (3.00%) high mild
3 (3.00%) high severe
mat4_transform_point3 time: [546.64 ps 546.97 ps 547.39 ps]
change: [-7.5737% -7.2913% -7.0061%] (p = 0.00 < 0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
2 (2.00%) low severe
3 (3.00%) low mild
3 (3.00%) high mild
5 (5.00%) high severe
mat3_transform_vector2_x4wide
time: [499.51 ps 499.85 ps 500.30 ps]
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
1 (1.00%) high mild
8 (8.00%) high severe
mat4_transform_vector3_x4wide
time: [774.40 ps 775.65 ps 777.02 ps]
Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) low mild
2 (2.00%) high mild
3 (3.00%) high severe
mat3_transform_point2_x4wide
time: [526.26 ps 529.71 ps 535.84 ps]
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low mild
3 (3.00%) high mild
7 (7.00%) high severe
mat4_transform_point3_x4wide
time: [796.16 ps 796.74 ps 797.42 ps]
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low severe
2 (2.00%) low mild
2 (2.00%) high mild
3 (3.00%) high severe
mat4_transform_vector3_no_division
time: [529.04 ps 530.22 ps 531.70 ps]
change: [-15.420% -13.707% -12.683%] (p = 0.00 < 0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low severe
7 (7.00%) high mild
3 (3.00%) high severe
$ bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2 time: [307.16 ps 307.34 ps 307.55 ps]
change: [+4.7516% +5.1517% +5.5680%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low severe
2 (2.00%) low mild
2 (2.00%) high severe
mat4_transform_vector3 time: [575.44 ps 576.15 ps 576.97 ps]
change: [+5.2866% +5.6903% +6.0084%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low severe
4 (4.00%) high mild
mat3_transform_point2 time: [301.50 ps 301.79 ps 302.15 ps]
change: [+2.2124% +2.5777% +2.9744%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
2 (2.00%) low severe
4 (4.00%) high mild
6 (6.00%) high severe
mat4_transform_point3 time: [584.46 ps 584.88 ps 585.45 ps]
change: [-4.1562% -3.8892% -3.5962%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
1 (1.00%) low severe
2 (2.00%) high mild
9 (9.00%) high severe
mat3_transform_vector2_x4wide
time: [530.59 ps 536.69 ps 544.38 ps]
change: [+6.3284% +6.9464% +7.6364%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) low severe
1 (1.00%) low mild
1 (1.00%) high mild
4 (4.00%) high severe
mat4_transform_vector3_x4wide
time: [813.42 ps 814.31 ps 815.26 ps]
change: [+4.3840% +4.9097% +5.3254%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low severe
4 (4.00%) high mild
mat3_transform_point2_x4wide
time: [523.32 ps 523.71 ps 524.08 ps]
change: [-5.0939% -2.1361% -0.3320%] (p = 0.08 > 0.05)
No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe
mat4_transform_point3_x4wide
time: [842.23 ps 842.86 ps 843.56 ps]
change: [+5.5553% +5.9587% +6.2648%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low severe
1 (1.00%) high severe
mat4_transform_vector3_no_division
time: [566.75 ps 567.19 ps 567.60 ps]
change: [-1.3344% -0.9935% -0.4433%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
3 (3.00%) high mild
2 (2.00%) high severe
Benchmark results on Apple M2 Max, Linux, compared to previous one:
$ cargo bench --all-features --bench nalgebra_bench -- _transform_
mat3_transform_vector2 time: [315.56 ps 316.70 ps 317.92 ps]
change: [-0.6503% -0.0844% +0.4491%] (p = 0.76 > 0.05)
No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
mat4_transform_vector3 time: [383.68 ps 383.82 ps 384.07 ps]
change: [-0.0143% +0.1640% +0.4628%] (p = 0.24 > 0.05)
No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
6 (6.00%) high mild
7 (7.00%) high severe
mat3_transform_point2 time: [318.32 ps 318.98 ps 319.63 ps]
change: [+0.9643% +1.4702% +2.0348%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high severe
mat4_transform_point3 time: [384.04 ps 384.21 ps 384.43 ps]
change: [+0.1676% +0.2195% +0.2713%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
mat3_transform_vector2_x4wide
time: [309.13 ps 309.56 ps 310.00 ps]
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
mat4_transform_vector3_x4wide
time: [460.39 ps 460.46 ps 460.56 ps]
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
mat3_transform_point2_x4wide
time: [308.68 ps 309.09 ps 309.52 ps]
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
mat4_transform_point3_x4wide
time: [460.38 ps 460.42 ps 460.46 ps]
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
mat4_transform_vector3_no_division
time: [383.69 ps 383.73 ps 383.78 ps]
change: [-0.0192% +0.0117% +0.0459%] (p = 0.51 > 0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
Benchmark results on Broadcom BCM2711 (Raspberry Pi 4), Linux, compared to previous one:
mat3_transform_vector2 time: [1.1179 ns 1.1185 ns 1.1191 ns]
change: [+0.1194% +0.2209% +0.3215%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
7 (7.00%) high mild
4 (4.00%) high severe
mat4_transform_vector3 time: [1.6815 ns 1.6817 ns 1.6819 ns]
change: [-0.0434% -0.0159% +0.0102%] (p = 0.26 > 0.05)
No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
7 (7.00%) high mild
2 (2.00%) high severe
mat3_transform_point2 time: [1.1170 ns 1.1174 ns 1.1179 ns]
change: [-0.0609% +0.0024% +0.0661%] (p = 0.94 > 0.05)
No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) high mild
7 (7.00%) high severe
mat4_transform_point3 time: [1.6817 ns 1.6819 ns 1.6821 ns]
change: [-0.0088% +0.0325% +0.0857%] (p = 0.20 > 0.05)
No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe
mat3_transform_vector2_x4wide
time: [2.2614 ns 2.2618 ns 2.2622 ns]
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) high mild
4 (4.00%) high severe
mat4_transform_vector3_x4wide
time: [3.3940 ns 3.3960 ns 3.3993 ns]
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) low mild
3 (3.00%) high mild
5 (5.00%) high severe
mat3_transform_point2_x4wide
time: [2.2614 ns 2.2617 ns 2.2619 ns]
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) high mild
3 (3.00%) high severe
mat4_transform_point3_x4wide
time: [3.3941 ns 3.3954 ns 3.3970 ns]
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) low mild
2 (2.00%) high mild
6 (6.00%) high severe
mat4_transform_vector3_no_division
time: [1.9712 ns 1.9720 ns 1.9729 ns]
change: [-0.0164% +0.0278% +0.0815%] (p = 0.26 > 0.05)
No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) high mild
6 (6.00%) high severe
The results are mostly expected: there are no significant regressions in non-SIMD benchmarks; SIMD makes sense even on the potato CPU of Raspberry Pi 4. The only exception is the older Intel CPU (i7-8565U), where transform_vector() and transform_point() slightly regressed for a 2D case, but improved for a 3D case.
It is also possible to leave existing functions as is, and add new SIMD-specific functions instead.
Other semi-related changes included in this PR:
Perspective3::project_vector()fixed so that result matches the result of matrix multiplication.- Added tests for
{Orthographic3, Perspective3}::project_vector()(mainly to understand what exactlyPerspective3projection does for a vector). - Added
codegen-units = 1for benchmarks, as otherwise results are less consistent and change after unrelated code changes. - Improved documentation for Matrix*::transform*() and Perspective3:project_*().
- Added tests for
Matrix*::transform_*().
P.S.: Added SIMD benchmarks require simba with this PR merged: https://github.com/dimforge/simba/pull/76
If you decide to merge this PR, please merge it without squashing into a single commit. Single commit will make benchmarking before/after changes much more difficult.
I asked my friends to run the same benchmarks on other CPUs (preferably Intel) and got some interesting results. It turned out that some benchmarks regressed and some improved on all tested Intel CPUs. The regression was more or less stable across runs, but improvements were seemingly random. Then we tried to disable the Intel Turbo Boost and this "fixed" both regressions and improvements.
My benchmarking routine is basically following:
cpupower frequency-set --governor performanceecho 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbogit checkout $BEFORE_CHANGESRUSTFLAGS="-C target-cpu=x86-64-v2 -C target-feature=+avx" cargo bench --all-features --bench nalgebra_bench -- --save-baseline base _transform_git checkout $AFTER_CHANGESRUSTFLAGS="-C target-cpu=x86-64-v2 -C target-feature=+avx" cargo bench --all-features --bench nalgebra_bench -- --baseline-lenient base _transform_
click to expand
Intel i9-9900K:
mat3_transform_vector2 time: [331.96 ps 335.43 ps 338.46 ps]
change: [-1.5766% -0.5575% +0.4401%] (p = 0.29 > 0.05)
No change in performance detected.
Found 20 outliers among 100 measurements (20.00%)
15 (15.00%) low severe
5 (5.00%) low mild
mat4_transform_vector3 time: [633.70 ps 634.00 ps 634.22 ps]
change: [-0.1789% +0.1501% +0.4816%] (p = 0.44 > 0.05)
No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
9 (9.00%) low severe
4 (4.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe
mat3_transform_point2 time: [333.65 ps 334.05 ps 334.41 ps]
change: [-0.6090% -0.2228% +0.1225%] (p = 0.25 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) low severe
2 (2.00%) low mild
mat4_transform_point3 time: [632.13 ps 632.38 ps 632.60 ps]
change: [-0.1452% +0.1086% +0.4231%] (p = 0.54 > 0.05)
No change in performance detected.
Found 19 outliers among 100 measurements (19.00%)
9 (9.00%) low severe
2 (2.00%) low mild
4 (4.00%) high mild
4 (4.00%) high severe
mat3_transform_vector2_x4wide
time: [623.93 ps 624.03 ps 624.11 ps]
Found 8 outliers among 100 measurements (8.00%)
4 (4.00%) low severe
2 (2.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
mat4_transform_vector3_x4wide
time: [921.04 ps 921.18 ps 921.30 ps]
Found 16 outliers among 100 measurements (16.00%)
7 (7.00%) low severe
4 (4.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
mat3_transform_point2_x4wide
time: [622.56 ps 622.84 ps 623.10 ps]
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low severe
3 (3.00%) low mild
mat4_transform_point3_x4wide
time: [921.65 ps 921.76 ps 921.87 ps]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) low severe
mat4_transform_vector3_no_division
time: [658.94 ps 658.98 ps 659.03 ps]
change: [-0.4661% -0.0747% +0.3524%] (p = 0.77 > 0.05)
No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) low severe
2 (2.00%) low mild
2 (2.00%) high mild
$ rustc -vV
rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc41)
binary: rustc
commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
commit-date: 2025-08-04
host: x86_64-unknown-linux-gnu
release: 1.89.0
LLVM version: 19.1.7
Intel i7-8565U:
mat3_transform_vector2 time: [670.49 ps 670.56 ps 670.65 ps]
change: [-1.0910% -0.3353% +0.2519%] (p = 0.41 > 0.05)
No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
3 (3.00%) low severe
3 (3.00%) high mild
8 (8.00%) high severe
mat4_transform_vector3 time: [1.2667 ns 1.2670 ns 1.2672 ns]
change: [-0.3486% -0.0169% +0.2990%] (p = 0.86 > 0.05)
No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
2 (2.00%) low severe
5 (5.00%) low mild
4 (4.00%) high mild
2 (2.00%) high severe
mat3_transform_point2 time: [670.51 ps 670.60 ps 670.67 ps]
change: [-0.6555% -0.1881% +0.2037%] (p = 0.43 > 0.05)
No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) low severe
1 (1.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
mat4_transform_point3 time: [1.2670 ns 1.2671 ns 1.2673 ns]
change: [-0.3295% -0.0358% +0.2261%] (p = 0.83 > 0.05)
No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
1 (1.00%) low severe
3 (3.00%) low mild
7 (7.00%) high mild
3 (3.00%) high severe
mat3_transform_vector2_x4wide
time: [1.2511 ns 1.2520 ns 1.2534 ns]
Found 11 outliers among 100 measurements (11.00%)
2 (2.00%) low severe
1 (1.00%) low mild
5 (5.00%) high mild
3 (3.00%) high severe
mat4_transform_vector3_x4wide
time: [1.8466 ns 1.8758 ns 1.9305 ns]
Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) low severe
1 (1.00%) low mild
2 (2.00%) high mild
6 (6.00%) high severe
mat3_transform_point2_x4wide
time: [1.2511 ns 1.2514 ns 1.2519 ns]
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) low severe
2 (2.00%) low mild
5 (5.00%) high mild
1 (1.00%) high severe
mat4_transform_point3_x4wide
time: [1.8463 ns 1.8467 ns 1.8471 ns]
Found 11 outliers among 100 measurements (11.00%)
4 (4.00%) low severe
4 (4.00%) high mild
3 (3.00%) high severe
mat4_transform_vector3_no_division
time: [1.3203 ns 1.3211 ns 1.3227 ns]
change: [-0.1767% +0.3608% +1.0624%] (p = 0.34 > 0.05)
No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
4 (4.00%) low severe
1 (1.00%) low mild
2 (2.00%) high mild
4 (4.00%) high severe
$ rustc -vV
rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc42)
binary: rustc
commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
commit-date: 2025-08-04
host: x86_64-unknown-linux-gnu
release: 1.89.0
LLVM version: 20.1.8
Qualcomm Snapdragon SC8280XP:
mat3_transform_vector2 time: [336.41 ps 336.44 ps 336.47 ps]
change: [-0.1865% -0.1207% -0.0586%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) low severe
3 (3.00%) high mild
3 (3.00%) high severe
mat4_transform_vector3 time: [672.13 ps 672.51 ps 672.94 ps]
change: [+19.334% +19.508% +19.680%] (p = 0.00 < 0.05)
Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
2 (2.00%) low severe
1 (1.00%) low mild
2 (2.00%) high mild
12 (12.00%) high severe
mat3_transform_point2 time: [368.87 ps 370.03 ps 371.06 ps]
change: [-1.7333% -1.2399% -0.7379%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) low mild
2 (2.00%) high mild
mat4_transform_point3 time: [672.53 ps 672.63 ps 672.74 ps]
change: [+19.161% +19.290% +19.392%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
5 (5.00%) low severe
1 (1.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
mat3_transform_vector2_x4wide
time: [1.1753 ns 1.1758 ns 1.1763 ns]
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) low mild
5 (5.00%) high mild
3 (3.00%) high severe
mat4_transform_vector3_x4wide
time: [1.2727 ns 1.2732 ns 1.2739 ns]
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) low mild
1 (1.00%) high mild
5 (5.00%) high severe
mat3_transform_point2_x4wide
time: [1.1764 ns 1.1765 ns 1.1766 ns]
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) low severe
2 (2.00%) high mild
2 (2.00%) high severe
mat4_transform_point3_x4wide
time: [1.2721 ns 1.2724 ns 1.2728 ns]
Found 13 outliers among 100 measurements (13.00%)
3 (3.00%) low severe
5 (5.00%) low mild
5 (5.00%) high severe
mat4_transform_vector3_no_division
time: [670.60 ps 670.65 ps 670.70 ps]
change: [+19.198% +19.363% +19.482%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) low severe
1 (1.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
$ rustc -vV
rustc 1.89.0 (29483883e 2025-08-04) (Fedora 1.89.0-2.fc42)
binary: rustc
commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
commit-date: 2025-08-04
host: aarch64-unknown-linux-gnu
release: 1.89.0
LLVM version: 20.1.8
Intel i3-1115G4:
# Before
mat3_transform_vector2 time: [269.47 ps 269.73 ps 269.97 ps]
Found 31 outliers among 100 measurements (31.00%)
19 (19.00%) low severe
1 (1.00%) low mild
2 (2.00%) high mild
9 (9.00%) high severe
mat4_transform_vector3 time: [715.00 ps 715.42 ps 715.80 ps]
Found 34 outliers among 100 measurements (34.00%)
24 (24.00%) low severe
2 (2.00%) high mild
8 (8.00%) high severe
mat3_transform_point2 time: [269.38 ps 269.67 ps 269.94 ps]
Found 30 outliers among 100 measurements (30.00%)
21 (21.00%) low severe
1 (1.00%) high mild
8 (8.00%) high severe
mat4_transform_point3 time: [714.56 ps 715.05 ps 715.52 ps]
Found 9 outliers among 100 measurements (9.00%)
6 (6.00%) low severe
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
mat4_transform_vector3_no_division
time: [740.11 ps 740.61 ps 741.10 ps]
Found 13 outliers among 100 measurements (13.00%)
9 (9.00%) low severe
4 (4.00%) low mild
# After
mat3_transform_vector2 time: [269.35 ps 269.56 ps 269.79 ps]
Found 32 outliers among 100 measurements (32.00%)
24 (24.00%) low severe
3 (3.00%) high mild
5 (5.00%) high severe
mat4_transform_vector3 time: [714.96 ps 715.42 ps 715.83 ps]
Found 22 outliers among 100 measurements (22.00%)
11 (11.00%) low severe
4 (4.00%) low mild
5 (5.00%) high mild
2 (2.00%) high severe
mat3_transform_point2 time: [270.14 ps 273.05 ps 276.49 ps]
Found 22 outliers among 100 measurements (22.00%)
5 (5.00%) low severe
7 (7.00%) low mild
10 (10.00%) high severe
mat4_transform_point3 time: [715.12 ps 715.77 ps 716.56 ps]
Found 13 outliers among 100 measurements (13.00%)
2 (2.00%) low severe
3 (3.00%) low mild
8 (8.00%) high severe
mat3_transform_vector2_x4wide
time: [783.04 ps 783.23 ps 783.41 ps]
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) low severe
2 (2.00%) high mild
2 (2.00%) high severe
mat4_transform_vector3_x4wide
time: [867.52 ps 879.03 ps 892.57 ps]
Found 23 outliers among 100 measurements (23.00%)
5 (5.00%) low severe
5 (5.00%) low mild
1 (1.00%) high mild
12 (12.00%) high severe
mat3_transform_point2_x4wide
time: [783.32 ps 784.07 ps 785.11 ps]
Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) low severe
3 (3.00%) high mild
6 (6.00%) high severe
mat4_transform_point3_x4wide
time: [871.11 ps 872.93 ps 874.44 ps]
Found 20 outliers among 100 measurements (20.00%)
11 (11.00%) low severe
7 (7.00%) low mild
2 (2.00%) high severe
mat4_transform_vector3_no_division
time: [740.93 ps 746.02 ps 752.11 ps]
Found 27 outliers among 100 measurements (27.00%)
9 (9.00%) low severe
5 (5.00%) low mild
13 (13.00%) high severe
fr0@calculate ~/nalgebra $ rustc -vV
rustc 1.88.0 (6b00bc388 2025-06-23) (gentoo)
binary: rustc
commit-hash: 6b00bc3880198600130e1cf62b8f8a93494488cc
commit-date: 2025-06-23
host: x86_64-unknown-linux-gnu
release: 1.88.0
LLVM version: 20.1.7
Intel Core Ultra 5 135H:
mat3_transform_vector2 time: [229.52 ps 239.71 ps 250.38 ps]
change: [+12.450% +18.380% +24.916%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
mat4_transform_vector3 time: [304.63 ps 316.86 ps 330.10 ps]
change: [+0.7303% +7.5166% +15.112%] (p = 0.04 < 0.05)
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
mat3_transform_point2 time: [202.05 ps 208.41 ps 215.52 ps]
change: [-1.5674% +5.0112% +12.211%] (p = 0.14 > 0.05)
No change in performance detected.
mat4_transform_point3 time: [298.05 ps 309.35 ps 320.75 ps]
change: [-4.2077% +1.6736% +7.5483%] (p = 0.57 > 0.05)
No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
mat3_transform_vector2_x4wide
time: [535.45 ps 546.86 ps 559.21 ps]
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
mat4_transform_vector3_x4wide
time: [569.78 ps 585.57 ps 601.68 ps]
mat3_transform_point2_x4wide
time: [540.24 ps 554.80 ps 570.01 ps]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
mat4_transform_point3_x4wide
time: [547.72 ps 560.44 ps 573.66 ps]
Found 10 outliers among 100 measurements (10.00%)
7 (7.00%) high mild
3 (3.00%) high severe
mat4_transform_vector3_no_division
time: [553.34 ps 569.96 ps 587.30 ps]
change: [+65.222% +75.055% +85.760%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
rustc 1.89.0 (29483883e 2025-08-04)
binary: rustc
commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2
commit-date: 2025-08-04
host: x86_64-pc-windows-msvc
release: 1.89.0 LLVM version: 20.1.7
Given the Turbo Boost-related weirdness and a significant consistent regression on Qualcomm SC8280XP, I think it will be better to just add separate SIMD-supporting functions. Will update this PR soon.
Done! Still requires Simba with https://github.com/dimforge/simba/pull/76
FYI: I filed a Rust issue about performance regressions with fat LTO and default codegen-units: https://github.com/rust-lang/rust/issues/146497
All benchmark that I did are invalid because of this: https://github.com/dimforge/nalgebra/issues/1547 😕