ckb-vm icon indicating copy to clipboard operation
ckb-vm copied to clipboard

pref: Optimize memory prefetch strategy by replacing prefetcht2 with prefetchnta

Open quake opened this issue 8 months ago • 3 comments

The prefetchnta instruction is better suited for our trace data access pattern because:

  • Trace data is accessed only once during asm execution
  • Using non-temporal prefetch reduces cache pollution by not displacing more frequently used data (e.g instructions_cache)

run benchmark multiple times, shows measurable improvements on two different x86 cpus (low and medium spec)

interpret secp256k1_bench via assembly
                        time:   [4.6501 ms 4.6595 ms 4.6743 ms]
                        change: [-1.8326% -1.6108% -1.3108%] (p = 0.00 < 0.05)
                        Performance has improved.
interpret secp256k1_bench via assembly
                        time:   [3.4878 ms 3.4889 ms 3.4901 ms]
                        change: [-2.1179% -1.9138% -1.7849%] (p = 0.00 < 0.05)
                        Performance has improved.

quake avatar Mar 10 '25 11:03 quake

On my Intel i9-14900K develop branch:

$ rm Cargo.lock; cargo bench
     Running benches/bits_benchmark.rs (target/release/deps/bits_benchmark-2198b6531a9750c2)
Gnuplot not found, using plotters backend
roundup via remainder   time:   [0.0000 ps 0.0000 ps 0.0000 ps]
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

roundup via bit ops     time:   [0.0000 ps 0.0000 ps 0.0000 ps]
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

roundup via multication time:   [0.0000 ps 0.0000 ps 0.0000 ps]
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

roundup via remainder #2
                        time:   [0.0000 ps 0.0000 ps 0.0000 ps]
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

roundup via bit ops #2  time:   [0.0000 ps 0.0000 ps 0.0000 ps]
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

roundup via multication #2
                        time:   [0.0000 ps 0.0000 ps 0.0000 ps]
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

     Running benches/vm_benchmark.rs (target/release/deps/vm_benchmark-fbea187d4c738a4c)
Gnuplot not found, using plotters backend
interpret secp256k1_bench
                        time:   [6.0670 ms 6.0788 ms 6.0925 ms]
Found 20 outliers among 100 measurements (20.00%)
  8 (8.00%) high mild
  12 (12.00%) high severe

This PR:

     Running benches/bits_benchmark.rs (target/release/deps/bits_benchmark-2198b6531a9750c2)
Gnuplot not found, using plotters backend
roundup via remainder   time:   [0.0000 ps 0.0000 ps 0.0000 ps]
                        change: [-47.895% -3.1511% +79.757%] (p = 0.92 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

roundup via bit ops     time:   [0.0000 ps 0.0000 ps 0.0000 ps]
                        change: [-77.445% -53.415% +14.119%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

roundup via multication time:   [0.0000 ps 0.0000 ps 0.0000 ps]
                        change: [-46.332% +2.2388% +92.464%] (p = 0.95 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

roundup via remainder #2
                        time:   [0.0000 ps 0.0000 ps 0.0000 ps]
                        change: [-46.500% -0.8975% +87.882%] (p = 0.98 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

roundup via bit ops #2  time:   [0.0000 ps 0.0000 ps 0.0000 ps]
                        change: [-42.725% +14.514% +133.60%] (p = 0.75 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

roundup via multication #2
                        time:   [0.0000 ps 0.0000 ps 0.0000 ps]
                        change: [-48.268% +0.2155% +91.318%] (p = 0.99 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

     Running benches/vm_benchmark.rs (target/release/deps/vm_benchmark-fbea187d4c738a4c)
Gnuplot not found, using plotters backend
interpret secp256k1_bench
                        time:   [6.0170 ms 6.0249 ms 6.0342 ms]
                        change: [-1.1492% -0.8870% -0.6528%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) high mild
  13 (13.00%) high severe


eval-exec avatar Mar 11 '25 03:03 eval-exec

Executing rm Cargo.lock; cargo bench "interpret secp256k1_bench via assembly" --features asm

On develop

     Running benches/bits_benchmark.rs (target/release/deps/bits_benchmark-d81b136bca03814f)
Gnuplot not found, using plotters backend
     Running benches/vm_benchmark.rs (target/release/deps/vm_benchmark-64854b411dd08e91)
Gnuplot not found, using plotters backend
Benchmarking interpret secp256k1_bench via assembly: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.2s, enable flat sampling, or reduce sample count to 50.
interpret secp256k1_bench via assembly
                        time:   [1.6080 ms 1.6105 ms 1.6133 ms]
                        change: [-0.1044% +0.1182% +0.3405%] (p = 0.29 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

Benchmarking interpret secp256k1_bench via assembly mop: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
interpret secp256k1_bench via assembly mop
                        time:   [1.5962 ms 1.5991 ms 1.6025 ms]
                        change: [+0.0404% +0.2952% +0.5713%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

Benchmarking interpret secp256k1_bench via assembly mop (memoized decoder): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.8s, enable flat sampling, or reduce sample count to 60.
Benchmarking interpret secp256k1_bench via assembly mop (memoized decoder): Collecting 100 samples in estimated 6.8175 s (
interpret secp256k1_bench via assembly mop (memoized decoder)
                        time:   [1.3446 ms 1.3472 ms 1.3502 ms]
                        change: [-0.8764% -0.2432% +0.3953%] (p = 0.49 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

Benchmarking interpret secp256k1_bench via assembly mop (memoized dynamic length decoder): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.5s, enable flat sampling, or reduce sample count to 60.
Benchmarking interpret secp256k1_bench via assembly mop (memoized dynamic length decoder): Collecting 100 samples in estim
interpret secp256k1_bench via assembly mop (memoized dynamic length decoder)
                        time:   [1.0916 ms 1.0938 ms 1.0966 ms]
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe


This PR:

     Running benches/bits_benchmark.rs (target/release/deps/bits_benchmark-d81b136bca03814f)
Gnuplot not found, using plotters backend
     Running benches/vm_benchmark.rs (target/release/deps/vm_benchmark-64854b411dd08e91)
Gnuplot not found, using plotters backend
Benchmarking interpret secp256k1_bench via assembly: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.0s, enable flat sampling, or reduce sample count to 50.
interpret secp256k1_bench via assembly
                        time:   [1.5726 ms 1.5750 ms 1.5777 ms]
                        change: [-2.4700% -2.2590% -2.0520%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high severe

Benchmarking interpret secp256k1_bench via assembly mop: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
interpret secp256k1_bench via assembly mop
                        time:   [1.5896 ms 1.5922 ms 1.5951 ms]
                        change: [-0.9286% -0.6165% -0.3233%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) high mild
  9 (9.00%) high severe

Benchmarking interpret secp256k1_bench via assembly mop (memoized decoder): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.8s, enable flat sampling, or reduce sample count to 60.
Benchmarking interpret secp256k1_bench via assembly mop (memoized decoder): Collecting 100 samples in estimated 6.7910 s (
interpret secp256k1_bench via assembly mop (memoized decoder)
                        time:   [1.3417 ms 1.3441 ms 1.3469 ms]
                        change: [-0.8756% -0.2123% +0.4776%] (p = 0.55 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

Benchmarking interpret secp256k1_bench via assembly mop (memoized dynamic length decoder): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.6s, enable flat sampling, or reduce sample count to 60.
Benchmarking interpret secp256k1_bench via assembly mop (memoized dynamic length decoder): Collecting 100 samples in estim
interpret secp256k1_bench via assembly mop (memoized dynamic length decoder)
                        time:   [1.1151 ms 1.1174 ms 1.1202 ms]
                        change: [+1.2550% +2.1229% +3.0103%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

eval-exec avatar Mar 11 '25 03:03 eval-exec

I created a bash script to run cargo bench "interpret secp256k1_bench via assembly" --features asm 21 times"

#!/usr/bin/env bash
set -e

for i in {0..20}; do
    echo git checkout to develop
    git checkout develop
    cargo bench "interpret secp256k1_bench via assembly" --features asm
    echo git checkout to quake/prefetchnta
    git checkout quake/prefetchnta
    cargo bench "interpret secp256k1_bench via assembly" --features asm

done

The bench result log file: bench.log

eval-exec avatar Mar 11 '25 04:03 eval-exec