PhastFT icon indicating copy to clipboard operation
PhastFT copied to clipboard

Optimising `cobra_apply`

Open mfreeborn opened this issue 3 weeks ago • 9 comments

See #46

mfreeborn avatar Nov 15 '25 15:11 mfreeborn

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 99.26%. Comparing base (2e67b5c) to head (4f87d2e).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #47      +/-   ##
==========================================
+ Coverage   99.16%   99.26%   +0.09%     
==========================================
  Files          12       12              
  Lines        2167     2165       -2     
==========================================
  Hits         2149     2149              
+ Misses         18       16       -2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Nov 15 '25 15:11 codecov-commenter

On my Zen 4 CPU this is a consistent regression in the default configuration and makes little difference with -C target-cpu=native:

cargo bench --bench=bit_reversal
     Running benches/bit_reversal.rs (target/release/deps/bit_reversal-7310e7572d98d06c)
cobra_apply/cobra/15    time:   [53.719 µs 53.849 µs 54.000 µs]
                        change: [+20.898% +21.193% +21.543%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
cobra_apply/cobra/16    time:   [105.59 µs 105.76 µs 105.92 µs]
                        change: [+24.041% +24.295% +24.599%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
cobra_apply/cobra/17    time:   [210.84 µs 211.11 µs 211.40 µs]
                        change: [+24.459% +24.613% +24.777%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
cobra_apply/cobra/18    time:   [417.87 µs 418.34 µs 418.83 µs]
                        change: [+24.279% +24.441% +24.611%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
cobra_apply/cobra/19    time:   [839.32 µs 839.56 µs 839.83 µs]
                        change: [+25.252% +25.328% +25.406%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
RUSTFLAGS='-C target-cpu=native' cargo bench --bench=bit_reversal
cobra_apply/cobra/15    time:   [40.607 µs 40.631 µs 40.656 µs]
                        change: [−1.5313% −1.4557% −1.3814%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
cobra_apply/cobra/16    time:   [79.011 µs 79.043 µs 79.082 µs]
                        change: [+0.5886% +0.6283% +0.6723%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
cobra_apply/cobra/17    time:   [158.08 µs 158.18 µs 158.29 µs]
                        change: [+1.4714% +1.5514% +1.6366%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe
cobra_apply/cobra/18    time:   [315.23 µs 315.38 µs 315.53 µs]
                        change: [+3.6858% +3.7494% +3.8149%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
cobra_apply/cobra/19    time:   [634.41 µs 634.94 µs 635.44 µs]
                        change: [+3.3992% +3.4810% +3.5690%] (p = 0.00 < 0.05)
                        Performance has regressed.

On what hardware did you measure it?

Shnatsel avatar Nov 15 '25 15:11 Shnatsel

Interesting!

CPU is AMD Ryzen™ 5 5625U with Radeon™ Graphics × 12.

That said, I didn't set the target-cpu...

mfreeborn avatar Nov 15 '25 15:11 mfreeborn

Hmm. Rust version? Mine is rustc 1.91.1 (ed61e7d7e 2025-11-07)

Shnatsel avatar Nov 15 '25 15:11 Shnatsel

rust 1.91.0

The +/- %s might be bit messed up because of the order I ran the benches, but the absolute numbers show a stark benefit of the LUT!

With LUT, target-cpu=native Benchmarking cobra_apply/cobra/15: Collecting 100 samples in estimated 5.1372 s (81k i cobra_apply/cobra/15 time: [64.174 µs 64.578 µs 65.008 µs] change: [−21.340% −20.823% −20.346%] (p = 0.00
Without LUT, target-cpu=native Benchmarking cobra_apply/cobra/15: Collecting 100 samples in estimated 5.2045 s (50k i cobra_apply/cobra/15 time: [103.51 µs 103.85 µs 104.23 µs] change: [+60.409% +61.184% +62.011%] (p = 0.00
With LUT, no RUSTFLAGS cobra_apply/cobra/15 time: [76.037 µs 76.244 µs 76.491 µs] change: [−28.284% −27.892% −27.445%] (p = 0.00
Without LUT, no RUSTFLAGS Benchmarking cobra_apply/cobra/15: Collecting 100 samples in estimated 5.3685 s (50k i cobra_apply/cobra/15 time: [106.26 µs 106.56 µs 106.90 µs] change: [+2.3883% +2.8431% +3.2761%] (p = 0.00

mfreeborn avatar Nov 15 '25 16:11 mfreeborn

Tip: to make percentages make sense, you can run

cargo bench --bench=bit_reversal -- --save-baseline=main followed by cargo bench --bench=bit_reversal -- --baseline=main and it will calculate percentages relative to the baseline saved by the first command.

Shnatsel avatar Nov 15 '25 16:11 Shnatsel

Ah that's useful. Criterion is one of these tools which I severely under use. If I ever read the docs, I could probably figure out how to group the with- and without-LUT variants into a single benchmark for much easier direct comparison.

On Sat, 15 Nov 2025, 16:09 Shnatsel, @.***> wrote:

Shnatsel left a comment (QuState/PhastFT#47) https://github.com/QuState/PhastFT/pull/47#issuecomment-3536630044

Tip: to make percentages make sense, you can run

cargo bench --bench=bit_reversal -- --save-baseline=main followed by cargo bench --bench=bit_reversal -- --baseline=main and it will calculate percentages relative to the baseline saved by the first command.

— Reply to this email directly, view it on GitHub https://github.com/QuState/PhastFT/pull/47#issuecomment-3536630044, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHSVKWAJFRYCGOFFIYM4GP3345F45AVCNFSM6AAAAACMGPWXK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKMZWGYZTAMBUGQ . You are receiving this because you authored the thread.Message ID: @.***>

mfreeborn avatar Nov 15 '25 16:11 mfreeborn

I don't think there's a one-size-fits-all solution. If we want to reap these gains, we'll need to copy FFTW's design and measure the performance of various implementations at runtime, then select the fastest one.

Shnatsel avatar Nov 23 '25 11:11 Shnatsel

I've looked into COBRA some more and it's highly hardware-dependent: https://github.com/QuState/PhastFT/issues/49

We really do just need to start going down the FFTW route, measure the different variants in the planner and pick the best one for the hardware we're running on.

It would be great to have your LUT-based version as one of the options.

Shnatsel avatar Nov 23 '25 14:11 Shnatsel