added vectorization to generate_n

hpx::generate_n calls std::generate_n, if in parallel mode it splits up the work into chunks and calls generate_n on each chunk. Previously no execution policy was specified for std::generate_n (defaulted to seq), this PR changes it and mentions seq or unseq based on hpx::execution policy mentioned by user

par_unseq: scale

par: add

Mar 29 '23 19:03 Johan511

@Johan511 could you create graphs that use the same y-axis limits, please?

Mar 31 '23 19:03 hkaiser

The code you proposed touches on sequential operations only. Could you measure the sequential speedup as well?

Mar 31 '23 19:03 hkaiser

The change works for par_unseq too as parallel version of generate_n works by calling sequential generate on chunks. Will post speedups for unseq soon.

Apr 01 '23 04:04 Johan511

Disposed runs which took more than 0.4ms

unseq : mean : 0.15 scale

seq : 0.2 add

Apr 08 '23 23:04 Johan511

Please note that the performance gains are actually not very significant. Reason for minimal performance gains is because std::generate_n has very similar performance when compiled with -O3 flag. Google bench results are attached.

Without -O3 flag

Benchmark Time CPU Iterations BM_gen_n_par 6889328 ns 6888372 ns 76 BM_gen_n_par_unseq 2665548 ns 2665325 ns 257

With -O3 flag

Benchmark Time CPU Iterations BM_gen_n_par 124210 ns 124210 ns 4038 BM_gen_n_par_unseq 159027 ns 159020 ns 4949

Apr 09 '23 06:04 Johan511

Please note that the performance gains are actually not very significant. Reason for minimal performance gains is because std::generate_n has very similar performance when compiled with -O3 flag. Google bench results are attached.

Without -O3 flag

Benchmark Time CPU Iterations BM_gen_n_par 6889328 ns 6888372 ns 76 BM_gen_n_par_unseq 2665548 ns 2665325 ns 257

With -O3 flag

Benchmark Time CPU Iterations BM_gen_n_par 124210 ns 124210 ns 4038 BM_gen_n_par_unseq 159027 ns 159020 ns 4949

You should always enable all optimizations for performance measurements.

Apr 09 '23 14:04 hkaiser

-O3 flag seems to tries vectorize most loops. Should I try compiling HPX with O2 flag and compare performance of vectorized vs non vectorized?

Often times the performance on vectorization gains seem to be minimal as -O3 seems to already vectorize loops.

Apr 09 '23 15:04 Johan511

@Johan511 could you please rebase this onto master, now that the release is out?

May 03 '23 13:05 hkaiser

Performance test report

HPX Performance

Comparison

BENCHMARK	FORK_JOIN_EXECUTOR	PARALLEL_EXECUTOR	SCHEDULER_EXECUTOR
For Each	-	??	-

Info

Property	Before	After
HPX Datetime	2023-05-10T12:07:53+00:00	2023-05-16T21:41:46+00:00
HPX Commit	dcb541576898d370113946ba15fb58c20c8325b2	3f932501103139cd7d5eded79ea744448062f6da
Clustername	rostam	rostam
Datetime	2023-05-10T14:50:18.616050-05:00	2023-05-16T17:00:01.775607-05:00
Compiler	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Hostname	medusa08.rostam.cct.lsu.edu	medusa08.rostam.cct.lsu.edu
Envfile

Comparison

BENCHMARK	NO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch	-

Info

Property	Before	After
HPX Datetime	2023-05-10T12:07:53+00:00	2023-05-16T21:41:46+00:00
HPX Commit	dcb541576898d370113946ba15fb58c20c8325b2	3f932501103139cd7d5eded79ea744448062f6da
Clustername	rostam	rostam
Datetime	2023-05-10T14:52:35.047119-05:00	2023-05-16T17:02:24.950778-05:00
Compiler	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Hostname	medusa08.rostam.cct.lsu.edu	medusa08.rostam.cct.lsu.edu
Envfile

Comparison

BENCHMARK	FORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATOR	PARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATOR	SCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR
Stream Benchmark - Add	(=)	(=)	(=)
Stream Benchmark - Scale	(=)	(=)	(=)
Stream Benchmark - Triad	(=)	(=)	(=)
Stream Benchmark - Copy	(=)	(=)	(=)

Info

Property	Before	After
HPX Datetime	2023-05-10T12:07:53+00:00	2023-05-16T21:41:46+00:00
HPX Commit	dcb541576898d370113946ba15fb58c20c8325b2	3f932501103139cd7d5eded79ea744448062f6da
Clustername	rostam	rostam
Datetime	2023-05-10T14:52:52.237641-05:00	2023-05-16T17:02:44.582158-05:00
Compiler	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Hostname	medusa08.rostam.cct.lsu.edu	medusa08.rostam.cct.lsu.edu
Envfile

Explanation of Symbols

Symbol	MEANING
=	No performance change (confidence interval within ±1%)
(=)	Probably no performance change (confidence interval within ±2%)
(+)/(-)	Very small performance improvement/degradation (≤1%)
+/-	Small performance improvement/degradation (≤5%)
++/--	Large performance improvement/degradation (≤10%)
+++/---	Very large performance improvement/degradation (>10%)
?	Probably no change, but quite large uncertainty (confidence interval with ±5%)
??	Unclear result, very large uncertainty (±10%)
???	Something unexpected…

May 16 '23 22:05 StellarBot

inspect was reporting:

/libs/core/algorithms/include/hpx/parallel/algorithms/detail/generate.hpp

*I* missing #include (type_traits) for symbol std::true_type on line 57

Please rebase one more time to pull in all changes from master.

Jul 23 '23 21:07 hkaiser

@hkaiser I have rebased and added unit tests to ensure everything is working with generate_n algorithm with unseq and par_unseq execution policies.

And for the performance, both seq and unseq are almost generating same assembly with Release Mode (uses -O3). Because compiler is able to vectorize the loops in seq mode also, there is seems to be no extra gains using unseq. However, if compared with -fno-tree-vectorize -O3 which disables auto vectorization, there is a 3-5x speed up.

Oct 21 '23 18:10 srinivasyadav18

retest lsu

Oct 23 '23 14:10 hkaiser

And for the performance, both seq and unseq are almost generating same assembly with Release Mode (uses -O3). Because compiler is able to vectorize the loops in seq mode also, there is seems to be no extra gains using unseq. However, if compared with -fno-tree-vectorize -O3 which disables auto vectorization, there is a 3-5x speed up.

Can we construct test cases where the compiler is not able to vectorize things on its own?

Oct 23 '23 14:10 hkaiser

hpx hpx copied to clipboard

added vectorization to generate_n

Without -O3 flag

With -O3 flag

Without -O3 flag

With -O3 flag

HPX Performance

Comparison

Info

Comparison

Info

Comparison

Info

Explanation of Symbols

hpx
hpx copied to clipboard