EAMxx: slowdown when using high-frequency horiz_remapper
We observed a significant slowdown in RRMxx tests related to output writing (machine: dane). Two tests, named O34 and O38, differ only in their yaml files. The SYPD are 0.066 for O34 vs. 0.24 for O38.
Figures 1 and 2 show the timers for the O34 and O38 tests, respectively. Note that the walltime needs a scaling as O34 is a 8-day test and O38 is a 5-day test. The yamls with notably high walltime are highlighted in red. The summaries for "write_total", "run_output_streams", and "horiz_remap" are shown in bold. The slowest streams in O34:
- the "Betts" yamls include horiz_remapper (pg2 -> 1x1 highres domain) + high-frequency 2D/3D outputs + conditional sampling
- the "more" yaml includes horiz_remapper (pg2 -> 1x1 highres domain) + high-frequency 2D/3D outputs
- the "coarse" yaml includes horiz_remapper (pg2 -> ne30 global) + 2D/3D outputs
Seems that the combination of horiz_remapper and high-frequency outputs is likely the main reason for the slowdown. Conditional sampling could be another contributing factor, as 1hI more is much faster than 1hI Betts in O38, despite that the former includes more 2D and 3D variables.
Also worth noting: the 1hA ("hourly average") Betts output in O34 has a comparable wallclock to the 1step Betts, and both are much slower than the 1hI ("hourly instant") Betts in O38. These Betts yamls share the same var list except the averaging_type and frequency.
yaml files can be found here:
- O34: https://portal.nersc.gov/cfs/e3sm/zhang73/yaml/debug/2240x1_ndaysx8_E3SMv1SSP585-UVTQ2d-s20151001-O34Betts/data/
- O38: https://portal.nersc.gov/cfs/e3sm/zhang73/yaml/debug/2240x1_ndaysx5_E3SMv1SSP585-UVTQ2d-s20151001-O38-5minAsite-rad3/data/
Do you have any ideas regarding this behavior? @AaronDonahue @bartgol It may not be a high-priority issue, since 1hI output is sufficient for current needs, but it may be worth noting.
=============
Fig 1: O34 timer:
Fig 2: O38 timer:
The PIOc_write_darray timer is quite large, so it would "seem" that most of the time for those streams is spent in writing. Maybe for the remapped output (1x1 stands for a single col in the output file, yes?) we should not use PIOc_write_darray, but use put_var, like we do for non-decomposed variables?
@jayeshkrishna may have some thoughts on this.
If the variable is not decomposed across processes using put_var might be worth trying out (but again depends on how large the variable is etc) @jsbamboo : Can you try increasing the SCORPIO cache buffer size limit first and see if it helps (./xmlchange PIO_BUFFER_SIZE_LIMIT=134217728 ; Setting the buffer size to 128MB)?
@jayeshkrishna If I understand correctly, the variable is relatively small. Probably a few hundred doubles worth of data.
@jsbamboo btw, clicking on the links in your message takes me to the server page with all the yamls, but clicking on the yamls to download/open them gives me a "Forbidden You don't have permission to access this resource." error.
thanks for the suggestions! @bartgol yes 1x1's n_b is 1. thanks for letting me the permission issue - could you please try it again?
@jayeshkrishna the new O34 test with PIO_BUFFER_SIZE_LIMIT=134217728 is a little bit slower than the original O34: 8.645 vs. 8.018 compute hours (the small difference in speed may be due to variations in the node conditions on dane)
please see the timers and env_run.xml files for both tests at the NERSC gateway link below: 034 timer 034 128M timer
@jayeshkrishna to clarify a bit the situation, here we probably have a case where we have a "decomp" where ALL array entries are on one rank. E.g., assume we output the column that is the closest to a given lat/lon, which will give an array that is exclusively on one rank. In this case, what do you recommend we do? Should we use PIOc_write_darray, PIOc_put_var, or something else?
Yeah, try the put_var for this variable