misc: Add MPI0 logging level
This logging mode helps to reduce the output of MPI runs to only emit performance numbers once per MPI rank. It is mostly aimed to reduce the redundant verbose performance output.
e.g. instead of:
Allocating host memory for u(2, 52, 72, 72) [2 MB]
Allocating host memory for u(2, 52, 72, 72) [2 MB]
Operator `Kernel` generated in 1.72 s
* lowering.IET: 0.71 s (41.5 %)
* specializing.IET: 0.66 s (38.6 %)
* make_mpi: 0.45 s (26.3 %)
* lowering.Expressions: 0.56 s (32.8 %)
* lowering.Clusters: 0.41 s (24.0 %)
Flops reduction after symbolic optimization: [209 --> 90]
Operator `Kernel` generated in 1.81 s
* lowering.IET: 0.72 s (40.0 %)
* specializing.IET: 0.66 s (36.6 %)
* make_mpi: 0.45 s (25.0 %)
* lowering.Expressions: 0.58 s (32.2 %)
* lowering.Clusters: 0.48 s (26.7 %)
Flops reduction after symbolic optimization: [209 --> 90]
Operator `Kernel` fetched `/tmp/devito-jitcache-uid1000/562b2d8ee0a6d254108e1dfb5522a645f2d8d8e2.c` in 0.06 s from jit-cache
Operator `Kernel` fetched `/tmp/devito-jitcache-uid1000/562b2d8ee0a6d254108e1dfb5522a645f2d8d8e2.c` in 0.06 s from jit-cache
Operator `Kernel` ran in 0.01 s
Operator `Kernel` ran in 0.01 s
Global performance: [OI=2.81, 11.16 GFlops/s, 0.14 GPts/s]
Global performance: [OI=2.81, 11.16 GFlops/s, 0.14 GPts/s]
Local performance:
Local performance:
* section0[rank0]<21,2,5,8,8,40> ran in 0.01 s [OI=2.81, 10.95 GFlops/s, 0.14 GPts/s]
* section0[rank0]<21,2,5,8,8,40> ran in 0.01 s [OI=2.81, 10.95 GFlops/s, 0.14 GPts/s]
* section0[rank1]<21,2,5,8,8,40> ran in 0.01 s [OI=2.81, 10.95 GFlops/s, 0.14 GPts/s]
Performance[mode=advanced] arguments: {'x0_blk0_size': 8, 'y0_blk0_size': 8}
* section0[rank1]<21,2,5,8,8,40> ran in 0.01 s [OI=2.81, 10.95 GFlops/s, 0.14 GPts/s]
Performance[mode=advanced] arguments: {'x0_blk0_size': 8, 'y0_blk0_size': 8}
Norm u is 2325.5583
Norm u is 2325.5583
we get:
Allocating host memory for u(2, 52, 72, 72) [2 MB]
Allocating host memory for u(2, 52, 72, 72) [2 MB]
Operator `Kernel` generated in 1.81 s
* lowering.IET: 0.73 s (40.6 %)
* specializing.IET: 0.67 s (37.3 %)
* make_mpi: 0.46 s (25.6 %)
* lowering.Expressions: 0.54 s (30.0 %)
* lowering.Clusters: 0.51 s (28.4 %)
* specializing.Clusters: 0.39 s (21.7 %)
Flops reduction after symbolic optimization: [209 --> 90]
Operator `Kernel` generated in 1.80 s
* lowering.IET: 0.75 s (41.8 %)
* specializing.IET: 0.69 s (38.4 %)
* make_mpi: 0.48 s (26.8 %)
* lowering.Expressions: 0.56 s (31.2 %)
* lowering.Clusters: 0.46 s (25.6 %)
Flops reduction after symbolic optimization: [209 --> 90]
Operator `Kernel` fetched `/tmp/devito-jitcache-uid1000/562b2d8ee0a6d254108e1dfb5522a645f2d8d8e2.c` in 0.06 s from jit-cache
Operator `Kernel` fetched `/tmp/devito-jitcache-uid1000/562b2d8ee0a6d254108e1dfb5522a645f2d8d8e2.c` in 0.06 s from jit-cache
Operator `Kernel` ran in 0.01 s
Global performance: [OI=2.81, 11.16 GFlops/s, 0.14 GPts/s]
Local performance:
* section0[rank0]<21,2,5,8,8,40> ran in 0.01 s [OI=2.81, 11.65 GFlops/s, 0.15 GPts/s]
* section0[rank1]<21,2,5,8,8,40> ran in 0.01 s [OI=2.81, 11.67 GFlops/s, 0.15 GPts/s]
Performance[mode=advanced] arguments: {'x0_blk0_size': 8, 'y0_blk0_size': 8}
Norm u is 2325.5583
Norm u is 2325.558
Codecov Report
Attention: Patch coverage is 93.75000% with 1 lines in your changes are missing coverage. Please review.
Project coverage is 79.46%. Comparing base (
8729cd2) to head (e459d52).
| Files | Patch % | Lines |
|---|---|---|
| devito/operator/profiling.py | 0.00% | 0 Missing and 1 partial :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #2130 +/- ##
==========================================
+ Coverage 79.44% 79.46% +0.01%
==========================================
Files 232 232
Lines 43607 43618 +11
Branches 8072 8073 +1
==========================================
+ Hits 34645 34660 +15
+ Misses 8207 8203 -4
Partials 755 755
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
To add:
-Until Devito v3.5 included, domain decomposition occurs along the fastest axis. As of later versions, domain decomposition occurs along the slowest axis, for performance reasons. And yes, it is possible to control the domain decomposition in user code, but this is not neatly documented. Take a look at test_custom_topology in this file. In essence, Grid accepts the optional argument topology, which allows the user to pass a custom topology as an n-tuple, where n is the number of distributed dimensions. For example, for a two-dimensional grid, the topology (4, 1) will decompose the slowest axis into four partitions, one partition per MPI rank, while the fastest axis will be replicated over all MPI ranks.
+Until Devito v3.5 included, domain decomposition occurs along the fastest axis. As of later versions, domain decomposition occurs along the slowest axis, for performance reasons. And yes, it is possible to control the domain decomposition in user code, but this is not neatly documented. Take a look at
+class CustomTopology in distributed.py
+and test_custom_topology in this file. In essence, Grid accepts the optional argument topology, which allows the user to pass a custom topology as an n-tuple, where n is the number of distributed dimensions. For example, for a two-dimensional grid, the topology (4, 1) will decompose the slowest axis into four partitions, one partition per MPI rank, while the fastest axis will be replicated over all MPI ranks.
Would this PR help: https://github.com/devitocodes/devito/pull/2175 to avoid changing the user-level setup and only make a temp switchconfig as per the original plan?
Is this PR fine?
Converting to draft as apprantly switchcongfig does not work