hpx
hpx copied to clipboard
added vectorization to generate_n
added vectorization to generate_n
hpx::generate_n calls std::generate_n, if in parallel mode it splits up the work into chunks and calls generate_n on each chunk. Previously no execution policy was specified for std::generate_n (defaulted to seq), this PR changes it and mentions seq or unseq based on hpx::execution policy mentioned by user
par_unseq:
par:
@Johan511 could you create graphs that use the same y-axis limits, please?
The code you proposed touches on sequential operations only. Could you measure the sequential speedup as well?
The change works for par_unseq too as parallel version of generate_n works by calling sequential generate on chunks. Will post speedups for unseq soon.
Disposed runs which took more than 0.4ms
unseq :
mean : 0.15
seq : 0.2
Please note that the performance gains are actually not very significant. Reason for minimal performance gains is because std::generate_n has very similar performance when compiled with -O3 flag. Google bench results are attached.
Without -O3 flag
Benchmark Time CPU Iterations BM_gen_n_par 6889328 ns 6888372 ns 76 BM_gen_n_par_unseq 2665548 ns 2665325 ns 257
With -O3 flag
Benchmark Time CPU Iterations BM_gen_n_par 124210 ns 124210 ns 4038 BM_gen_n_par_unseq 159027 ns 159020 ns 4949
Please note that the performance gains are actually not very significant. Reason for minimal performance gains is because std::generate_n has very similar performance when compiled with -O3 flag. Google bench results are attached.
Without -O3 flag
Benchmark Time CPU Iterations BM_gen_n_par 6889328 ns 6888372 ns 76 BM_gen_n_par_unseq 2665548 ns 2665325 ns 257
With -O3 flag
Benchmark Time CPU Iterations BM_gen_n_par 124210 ns 124210 ns 4038 BM_gen_n_par_unseq 159027 ns 159020 ns 4949
You should always enable all optimizations for performance measurements.
-O3 flag seems to tries vectorize most loops. Should I try compiling HPX with O2 flag and compare performance of vectorized vs non vectorized?
Often times the performance on vectorization gains seem to be minimal as -O3 seems to already vectorize loops.
@Johan511 could you please rebase this onto master, now that the release is out?
Performance test report
HPX Performance
Comparison
BENCHMARK | FORK_JOIN_EXECUTOR | PARALLEL_EXECUTOR | SCHEDULER_EXECUTOR |
---|---|---|---|
For Each | - | ?? | - |
Info
Property | Before | After |
---|---|---|
HPX Datetime | 2023-05-10T12:07:53+00:00 | 2023-05-16T21:41:46+00:00 |
HPX Commit | dcb541576898d370113946ba15fb58c20c8325b2 | 3f932501103139cd7d5eded79ea744448062f6da |
Clustername | rostam | rostam |
Datetime | 2023-05-10T14:50:18.616050-05:00 | 2023-05-16T17:00:01.775607-05:00 |
Compiler | /opt/apps/llvm/13.0.1/bin/clang++ 13.0.1 | /opt/apps/llvm/13.0.1/bin/clang++ 13.0.1 |
Hostname | medusa08.rostam.cct.lsu.edu | medusa08.rostam.cct.lsu.edu |
Envfile |
Comparison
BENCHMARK | NO-EXECUTOR |
---|---|
Future Overhead - Create Thread Hierarchical - Latch | - |
Info
Property | Before | After |
---|---|---|
HPX Datetime | 2023-05-10T12:07:53+00:00 | 2023-05-16T21:41:46+00:00 |
HPX Commit | dcb541576898d370113946ba15fb58c20c8325b2 | 3f932501103139cd7d5eded79ea744448062f6da |
Clustername | rostam | rostam |
Datetime | 2023-05-10T14:52:35.047119-05:00 | 2023-05-16T17:02:24.950778-05:00 |
Compiler | /opt/apps/llvm/13.0.1/bin/clang++ 13.0.1 | /opt/apps/llvm/13.0.1/bin/clang++ 13.0.1 |
Hostname | medusa08.rostam.cct.lsu.edu | medusa08.rostam.cct.lsu.edu |
Envfile |
Comparison
BENCHMARK | FORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATOR | PARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATOR | SCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR |
---|---|---|---|
Stream Benchmark - Add | (=) | (=) | (=) |
Stream Benchmark - Scale | (=) | (=) | (=) |
Stream Benchmark - Triad | (=) | (=) | (=) |
Stream Benchmark - Copy | (=) | (=) | (=) |
Info
Property | Before | After |
---|---|---|
HPX Datetime | 2023-05-10T12:07:53+00:00 | 2023-05-16T21:41:46+00:00 |
HPX Commit | dcb541576898d370113946ba15fb58c20c8325b2 | 3f932501103139cd7d5eded79ea744448062f6da |
Clustername | rostam | rostam |
Datetime | 2023-05-10T14:52:52.237641-05:00 | 2023-05-16T17:02:44.582158-05:00 |
Compiler | /opt/apps/llvm/13.0.1/bin/clang++ 13.0.1 | /opt/apps/llvm/13.0.1/bin/clang++ 13.0.1 |
Hostname | medusa08.rostam.cct.lsu.edu | medusa08.rostam.cct.lsu.edu |
Envfile |
Explanation of Symbols
Symbol | MEANING |
---|---|
= | No performance change (confidence interval within ±1%) |
(=) | Probably no performance change (confidence interval within ±2%) |
(+)/(-) | Very small performance improvement/degradation (≤1%) |
+/- | Small performance improvement/degradation (≤5%) |
++/-- | Large performance improvement/degradation (≤10%) |
+++/--- | Very large performance improvement/degradation (>10%) |
? | Probably no change, but quite large uncertainty (confidence interval with ±5%) |
?? | Unclear result, very large uncertainty (±10%) |
??? | Something unexpected… |
inspect was reporting:
/libs/core/algorithms/include/hpx/parallel/algorithms/detail/generate.hpp
*I* missing #include (type_traits) for symbol std::true_type on line 57
Please rebase one more time to pull in all changes from master.
@hkaiser I have rebased and added unit tests to ensure everything is working with generate_n algorithm with unseq and par_unseq execution policies.
And for the performance, both seq and unseq are almost generating same assembly with Release Mode (uses -O3). Because compiler is able to vectorize the loops in seq mode also, there is seems to be no extra gains using unseq.
However, if compared with -fno-tree-vectorize -O3
which disables auto vectorization, there is a 3-5x speed up.
retest lsu
And for the performance, both seq and unseq are almost generating same assembly with Release Mode (uses -O3). Because compiler is able to vectorize the loops in seq mode also, there is seems to be no extra gains using unseq. However, if compared with
-fno-tree-vectorize -O3
which disables auto vectorization, there is a 3-5x speed up.
Can we construct test cases where the compiler is not able to vectorize things on its own?