hpx icon indicating copy to clipboard operation
hpx copied to clipboard

added vectorization to generate_n

Open Johan511 opened this issue 1 year ago • 13 comments

added vectorization to generate_n

hpx::generate_n calls std::generate_n, if in parallel mode it splits up the work into chunks and calls generate_n on each chunk. Previously no execution policy was specified for std::generate_n (defaulted to seq), this PR changes it and mentions seq or unseq based on hpx::execution policy mentioned by user

par_unseq: scale

par: add

Johan511 avatar Mar 29 '23 19:03 Johan511

@Johan511 could you create graphs that use the same y-axis limits, please?

hkaiser avatar Mar 31 '23 19:03 hkaiser

The code you proposed touches on sequential operations only. Could you measure the sequential speedup as well?

hkaiser avatar Mar 31 '23 19:03 hkaiser

The change works for par_unseq too as parallel version of generate_n works by calling sequential generate on chunks. Will post speedups for unseq soon.

Johan511 avatar Apr 01 '23 04:04 Johan511

Disposed runs which took more than 0.4ms

unseq : mean : 0.15 scale

seq : 0.2 add

Johan511 avatar Apr 08 '23 23:04 Johan511

Please note that the performance gains are actually not very significant. Reason for minimal performance gains is because std::generate_n has very similar performance when compiled with -O3 flag. Google bench results are attached.

Without -O3 flag

Benchmark Time CPU Iterations BM_gen_n_par 6889328 ns 6888372 ns 76 BM_gen_n_par_unseq 2665548 ns 2665325 ns 257

With -O3 flag

Benchmark Time CPU Iterations BM_gen_n_par 124210 ns 124210 ns 4038 BM_gen_n_par_unseq 159027 ns 159020 ns 4949

Johan511 avatar Apr 09 '23 06:04 Johan511

Please note that the performance gains are actually not very significant. Reason for minimal performance gains is because std::generate_n has very similar performance when compiled with -O3 flag. Google bench results are attached.

Without -O3 flag

Benchmark Time CPU Iterations BM_gen_n_par 6889328 ns 6888372 ns 76 BM_gen_n_par_unseq 2665548 ns 2665325 ns 257

With -O3 flag

Benchmark Time CPU Iterations BM_gen_n_par 124210 ns 124210 ns 4038 BM_gen_n_par_unseq 159027 ns 159020 ns 4949

You should always enable all optimizations for performance measurements.

hkaiser avatar Apr 09 '23 14:04 hkaiser

-O3 flag seems to tries vectorize most loops. Should I try compiling HPX with O2 flag and compare performance of vectorized vs non vectorized?

Often times the performance on vectorization gains seem to be minimal as -O3 seems to already vectorize loops.

Johan511 avatar Apr 09 '23 15:04 Johan511

@Johan511 could you please rebase this onto master, now that the release is out?

hkaiser avatar May 03 '23 13:05 hkaiser

Performance test report

HPX Performance

Comparison

BENCHMARKFORK_JOIN_EXECUTORPARALLEL_EXECUTORSCHEDULER_EXECUTOR
For Each-??-

Info

PropertyBeforeAfter
HPX Datetime2023-05-10T12:07:53+00:002023-05-16T21:41:46+00:00
HPX Commitdcb541576898d370113946ba15fb58c20c8325b23f932501103139cd7d5eded79ea744448062f6da
Clusternamerostamrostam
Datetime2023-05-10T14:50:18.616050-05:002023-05-16T17:00:01.775607-05:00
Compiler/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Envfile

Comparison

BENCHMARKNO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch-

Info

PropertyBeforeAfter
HPX Datetime2023-05-10T12:07:53+00:002023-05-16T21:41:46+00:00
HPX Commitdcb541576898d370113946ba15fb58c20c8325b23f932501103139cd7d5eded79ea744448062f6da
Clusternamerostamrostam
Datetime2023-05-10T14:52:35.047119-05:002023-05-16T17:02:24.950778-05:00
Compiler/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Envfile

Comparison

BENCHMARKFORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATORPARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATORSCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR
Stream Benchmark - Add(=)(=)(=)
Stream Benchmark - Scale(=)(=)(=)
Stream Benchmark - Triad(=)(=)(=)
Stream Benchmark - Copy(=)(=)(=)

Info

PropertyBeforeAfter
HPX Datetime2023-05-10T12:07:53+00:002023-05-16T21:41:46+00:00
HPX Commitdcb541576898d370113946ba15fb58c20c8325b23f932501103139cd7d5eded79ea744448062f6da
Clusternamerostamrostam
Datetime2023-05-10T14:52:52.237641-05:002023-05-16T17:02:44.582158-05:00
Compiler/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Envfile

Explanation of Symbols

SymbolMEANING
=No performance change (confidence interval within ±1%)
(=)Probably no performance change (confidence interval within ±2%)
(+)/(-)Very small performance improvement/degradation (≤1%)
+/-Small performance improvement/degradation (≤5%)
++/--Large performance improvement/degradation (≤10%)
+++/---Very large performance improvement/degradation (>10%)
?Probably no change, but quite large uncertainty (confidence interval with ±5%)
??Unclear result, very large uncertainty (±10%)
???Something unexpected…

StellarBot avatar May 16 '23 22:05 StellarBot

inspect was reporting:

/libs/core/algorithms/include/hpx/parallel/algorithms/detail/generate.hpp

*I* missing #include (type_traits) for symbol std::true_type on line 57

Please rebase one more time to pull in all changes from master.

hkaiser avatar Jul 23 '23 21:07 hkaiser

@hkaiser I have rebased and added unit tests to ensure everything is working with generate_n algorithm with unseq and par_unseq execution policies.

And for the performance, both seq and unseq are almost generating same assembly with Release Mode (uses -O3). Because compiler is able to vectorize the loops in seq mode also, there is seems to be no extra gains using unseq. However, if compared with -fno-tree-vectorize -O3 which disables auto vectorization, there is a 3-5x speed up.

srinivasyadav18 avatar Oct 21 '23 18:10 srinivasyadav18

retest lsu

hkaiser avatar Oct 23 '23 14:10 hkaiser

And for the performance, both seq and unseq are almost generating same assembly with Release Mode (uses -O3). Because compiler is able to vectorize the loops in seq mode also, there is seems to be no extra gains using unseq. However, if compared with -fno-tree-vectorize -O3 which disables auto vectorization, there is a 3-5x speed up.

Can we construct test cases where the compiler is not able to vectorize things on its own?

hkaiser avatar Oct 23 '23 14:10 hkaiser