scream
scream copied to clipboard
P3 performance analysis
This issue documents some basic findings about P3's performance on the CPU and suggests action items for future performance work.
I was curious what the primary cost in the C++ P3 code is. It turns out to be
- https://github.com/E3SM-Project/scream/blob/46ff6b3cdabd0b8e86d1e05ce89f63f5e51ec53b/components/scream/src/physics/p3/p3_rain_sed_impl.hpp#L111
- the equivalent in ice sedimentation.
In particular, while one might guess that the upwind impl could be slow, it is not: calc_first_order_upwind_step is < 4% of the total P3 cost. In contrast, the rain and ice fall velocity calculations are very roughly 80%.
Possible action items:
- Profile using an Intel tool at the line level, starting with rain sedimentation. (1) Are there a few costly lines, e.g., a slow tgamma impl, or instead (2) is the cost per line fairly uniform over the whole velocity computation?
- If 2, then try a few different modifications to the Mask implementation: different integer sizes for the mask slots; different implementations (e.g. ternary op vs
if) for the masked ops. - If there is no big change, profile with pack size 1 to see if that reveals anything.
- Try a pack-free impl, using
scalarizeto produce 1D views of reals as inputs. This is a mask-intensive region of code, and the C++ compiler might not be able to handle it well.
@hr203 - this might be a good topic for you!