E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

EAMxx: refactor IO to more heavily rely on Field interfaces

Open bartgol opened this issue 7 months ago • 10 comments

Replaces manual parallel loops with usage of Field and Field utilities functions.

[BFB]


Summary of major mods:

  • add Max and Min as values for CombineMode enum
  • add shortcut min/max methods to Field (they call update, just like scale does)
  • change how we handle fillVal in combine operations: before, x+=fillVal gave x==fillVal. Now, fillVal on the rhs is always ignored, while we should NOT have fillVal on the lhs
  • simplified creation of diagnostics, ensuring that they can be evaluated in order
  • simplified handling of avg count: use masked operations to update the count, as well as to set FillValue where the count is too low)

For now, I marked the PR as BFB, although it's possible it won't be, since the avg count is now a field with IntType data type (instead of Real). But I still think it should be, since a) count is a small integer (so the double representation should have no truncation), and b) my_real / my_int makes my_int undergo int->real promotion before the calculation happens. Still, I'll keep an eye out for non-BFB results.

I've had this branch in the works for months, and decided it's time to get it going. Once this is in, I think it may become easier to handle fields with FillValue in a more agnostic way (that is, IO should not have to check if a field is a FieldAtPressureLevel diag, to toggle m_track_avg_cnt=true).

Fixes #7035 Closes #6965

bartgol avatar May 29 '25 21:05 bartgol

I ran ERS.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.eamxx-prod in master and on this branch, and perf is relatively similar. The branch ran maybe 2% faster, but we're in the noise. Of course, an ne1024 run is prob needed to ensure we're not messing up perf at large scale, but being on par at ne30 is a very good indicator, considering we are not changing any MPI-related code, but only the on-device parts.

bartgol avatar May 30 '25 03:05 bartgol

Does this PR also mean to update the two unrelated sub-modules?

Good catch! I removed those submods from the commit, and force pushed. Should be clean now.

bartgol avatar May 30 '25 03:05 bartgol

The diffs in the baseline_cmp standalone tests appear to be only in the fill pattern of the output fields. I believe that may be expected, but I have to double check. The CIME tests, however, also exhibit some real diffs. I have to dig.

bartgol avatar Jun 03 '25 00:06 bartgol

Quick follow up. The change in fill pattern is caused by how we handle t=0 output:

  • master: for t=0 instant output, some fields may be invalid (e.g., computed fields not in the IC file). For these, we deep copy FillVal into them.
  • branch: we NEVER have IO copy values into a field

The reason for this mod was that IO was inadvertendly changing values of fields in the FM (in case of no remap and no padding) which did not seem like a good behavior to me. IO should blindly output fields, without caring about what values are inside. It should be the AD that inits the fields to FillValue when it allocates them, or maybe it should be done directly by Field::allocate_view. I think I will open an issue to address this (helpful also to catch use of uninited fields, which right now are always 0-initialized, hiding this kind of errors).

I'm still digging what's wrong with the ERS test. I also ran a SMS eamxx-prod at ne30 on mappy, to make sure there was no perf drop. The branch was actually a smidge faster, but not much (maybe 2-3%?). However, the branch job crashed on day 5, which points to something not being right. This PR does changes a handful of things outside of the IO folder, which are the only paces that could be responsible for the crash, so it should be not too hard to track down.

bartgol avatar Jun 03 '25 22:06 bartgol

@singhbalwinder I'm tagging you as well since, by removing the possibility for AtmosphereInput to read into views, this PR had to also touch quite a bit of code in the mam optics files. It should be a simple review of that stuff though...assuming I did not mess it up...

bartgol avatar Jun 04 '25 01:06 bartgol

@AaronDonahue @mahf708 @tcclevenger The ERS fail was due to the avg count field not being restarted. This led to cleaning up also the scorpio input files, since we were not allowing reading of non-Real fields (and avg count has data type int). I think the cleanup was worth it, and tests should pass now (except for the fill patter diff, which I discussed above).

Once the tests finish, assuming they pass, would you mind taking another look?

bartgol avatar Jun 04 '25 01:06 bartgol

There are still a few things I want to triple check. E.g., in the CIME case that failed, there are fill pattern changes. That's expected for some vars (uninited ones) for "instant" output, but one of the vars is wind_speed with vert remap and "average" output. This hints at vremap working slightly differently, which may or may not be correct (there's always a chance that things were wrong before the PR). So I will dig a bit more on this.

bartgol avatar Jun 04 '25 04:06 bartgol

One side effect of this PR: on mappy, without any other job going (so no res contention), the compilation of scorpio_output.cpp has gone up from ~15s to ~120s. That's a bit too much. I suspect we are not taking full advantage of the ETI-ed Field methods. Maybe we should ETI some utils (like the mask utils), so that we build the int/float/double separately, rather than in one compilation unit. I'll check.

Edit: it was more like 160s, and ETI-ing the compute_mask utilities only brings it down to ~140s. I am not sure what other template calls are causing the explosion of compile time...

Edit: ah, maybe the max/min utilities. Let me ETI those too.

Edit: ok, ETI-ing the max/min utils as well brings compilation down to ~50s, which is more reasonable. We may need to revise what combination of Field methods/utils we want to ETI, since we now have quite a few field-related ETI files to build...

bartgol avatar Jun 05 '25 23:06 bartgol

Ok, I fixed the fill pattern diffs by fixing how the minimum avg count threshold is handled (master has count>thresh, but branch had count>=thresh; now the branch also does count>thresh).

Unfortunately, there is one file in the eamxx-prod CI run where the fill patterns seems completely wrong. I am trying to investigate.

bartgol avatar Jun 06 '25 17:06 bartgol

Ok, tests now pass, except for baseline comparison for INSTANT output containing fields that are not available at t=0. The diff is b/c

  • in master: we copy FillValue into those fields from inside IO
  • in branch: we just output whatever is in the field. For diags, we still fill with FillValue, but fields coming from the AD are left untouched.

I think the new behavior is more correct: it should not be up to IO to set uncomputed fields to FillValue. Instead, it should be done during allocation: inside Field::allocate_view(), also call a deep copy to FillValue.

After this PR, I think we should start tackling #6803 . I think using the "extra_data" feature in the FieldHeader may be the most uniform and robust approach.

bartgol avatar Jun 06 '25 21:06 bartgol

Launched ne256 runs for this branch and master on frontier to test perf impact. Will update and merge when I have results.

tcclevenger avatar Jul 09 '25 15:07 tcclevenger

Perf results for case SMS_Ld2.ne256pg2_ne256pg2.F2010-SCREAMv1.aurora_oneapi-ifxgpu.eamxx-prod(Aurora, ne256, 16 nodes), wallmax (s) of largest EAMxx::IO::... columns in timer file

Master This branch
a:EAMxx::IO::standard 309.2 312.9
a:EAMxx::IO::standard::get_new_file 296.3 302.1
a:EAMxx::IO::eamxx_output.decadal.4hourlyINST_native.yaml 88.5 86.4
a:EAMxx::IO::eamxx_output.decadal.1hourlyAVG_native.yaml 87.6 89.2
a:EAMxx::IO::eamxx_output.decadal.1hourlyMAX_native.yaml 72.6 70.747

@bartgol Performance seems roughly equivalent, new impl being slightly slower overall. Is this sufficient, or do you think we still need to check ne1024?

tcclevenger avatar Jul 09 '25 21:07 tcclevenger

@tcclevenger Are those timers the result of an AVG over N runs, or are they a single run? If the latter, can we re-run each case a few times?

As for the slowdown, most of the slowdowns are minimal. The larges slowdown in get_new_file is odd, since I don't think the branch changed dramatically that phase of IO. For INST output, it does involve a call to run, but the only INST stream you timed seems to be no-slower than master. Can you maybe paste also the timers from all the other streams? I wonder if certain output specs (e.g., remap) may be correlated with a slight slowdown.

Overall, I think the timings are fine. But we should maybe get some data also on frontier or pm-gpu.

bartgol avatar Jul 09 '25 21:07 bartgol

Some more timings, including PIO:write_total which dominates standard time, and has no changes in this PR:

2nd Aurora run for ne256

Master This branch
a:EAMxx::IO::standard 268.2 262.7
a:PIO:write_total 264.9 260.4

@bartgol ran same case on PMGPU 32 nodes

Master This branch
a:EAMxx::IO::standard 142.1 152.6
a:PIO:write_total 138.3 148.9

So there is high amount of variability per run, and time is dominated by write_total, not effected by this PR.

tcclevenger avatar Jul 10 '25 20:07 tcclevenger