Oceananigans.jl icon indicating copy to clipboard operation
Oceananigans.jl copied to clipboard

Small improvements to NaNChecker

Open glwagner opened this issue 10 months ago • 14 comments

Inspired by a discussion with @navidcy and @taimoorsohail, this PR implements two small improvements to the NaNChecker:

  1. Preallocate an object to store the result of the reduction which computes whether a field has NaNs or not, which saves some memory allocation (can be important if trying to diagnose NaNs every time-step
  2. Return the result of NaN checking from the callback function. Useful if we don't want to add NaNChecker directly as a callback but rather call it within another callback.

I may add a few more improvements here, because more ideas are coming to mind...

glwagner avatar May 04 '25 16:05 glwagner

bunch of tests fail, e.g.,

https://buildkite.com/clima/oceananigans/builds/23183#01969c41-533f-4867-bc39-028df30cd953/6-37

Is it because the fallback

NaNChecker(fields) = NaNChecker(fields, false) # default

was removed? Or perhaps because the fallback has an arg and now it became kwarg?

navidcy avatar May 05 '25 06:05 navidcy

I think because somebody is using that constructor. I think its better to have just one constructor so I will try to fix the errors

glwagner avatar May 05 '25 12:05 glwagner

@taimoorsohail, should work now and test will (hopefully) pass

navidcy avatar May 08 '25 12:05 navidcy

Tests are still failing - having a look now

taimoorsohail avatar May 09 '25 01:05 taimoorsohail

I haven't been able to figure out why these tests are failing - other than that they don't seem to be related to the NaN checker changes as far as I can tell.

taimoorsohail avatar May 09 '25 02:05 taimoorsohail

This

https://buildkite.com/clima/oceananigans/builds/23277#0196b2e0-8d99-4d3d-a91d-1a4e8c0275e9/18-1245

might seem unrelated but following the error stack trace it does originate from the nanchecker

https://buildkite.com/clima/oceananigans/builds/23277#0196b2e0-8d99-4d3d-a91d-1a4e8c0275e9/18-1273

I'll have a look now.

navidcy avatar May 09 '25 05:05 navidcy

@glwagner, seems like hasnan breaks for MultiRegion (probably also Distributed?) fields? Here's a mwe:

using Oceananigans

grid = RectilinearGrid(CPU(), size = (64, 3, 1), x = (0, 100e3), y = (0, 100e3), z = (-400, 0))

regions = 2
mrg = MultiRegionGrid(grid, partition = XPartition(regions))

model = HydrostaticFreeSurfaceModel(grid = mrg)

simulation = Simulation(model; Δt=0.1, stop_iteration = 10)
run!(simulation)
julia> run!(simulation)
[2025/05/09 12:12:48.874] INFO  Initializing simulation...
ERROR: MethodError: no method matching interior(::MultiRegionObject{…}, ::Tuple{…}, ::MultiRegionGrid{…}, ::MultiRegionObject{…})

Closest candidates are:
  interior(::Field, ::Any...)
   @ Oceananigans ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Fields/field.jl:407
  interior(::FieldTimeSeries, ::Any...)
   @ Oceananigans ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/OutputReaders/field_time_series.jl:724
  interior(::OffsetArray, ::Any, ::Any, ::Any)
   @ Oceananigans ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Fields/field.jl:406
  ...

Stacktrace:
  [1] interior(f::Field{…})
    @ Oceananigans.Fields ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Fields/field.jl:405
  [2] any!(f::Function, r::Field{…}, a::Field{…}; condition::Nothing, mask::Bool, kwargs::@Kwargs{})
    @ Oceananigans.Fields ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Fields/field.jl:682
  [3] any!(f::Function, r::Field{…}, a::Field{…})
    @ Oceananigans.Fields ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Fields/field.jl:673
  [4] hasnan(field::Field{…}, checker::Oceananigans.Models.NaNChecker{…})
    @ Oceananigans.Models ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Models/nan_checker.jl:40
  [5] (::Oceananigans.Models.NaNChecker{…})(simulation::Simulation{…})
    @ Oceananigans.Models ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Models/nan_checker.jl:49
  [6] (::Callback{…})(sim::Simulation{…})
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/callback.jl:15
  [7] initialize!(sim::Simulation{…})
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl:235
  [8] time_step!(sim::Simulation{…})
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl:136
  [9] run!(sim::Simulation{…}; pickup::Bool)
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl:105
 [10] run!(sim::Simulation{…})
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl:92
 [11] top-level scope
    @ REPL[75]:1
Some type information was truncated. Use `show(err)` to see complete types.

navidcy avatar May 09 '25 09:05 navidcy

Wondering if we need @apply_regionally at

https://github.com/CliMA/Oceananigans.jl/blob/9454ebc7637e96f152c1aa0410875c1449c333b1/src/Models/nan_checker.jl#L40

navidcy avatar May 09 '25 09:05 navidcy

Actually on second thought I'm wondering how would this work on distributed architectures. Do we need to be calling any! eg on each MPI rank and then combine?

navidcy avatar May 11 '25 06:05 navidcy

Wondering if we need @apply_regionally at

https://github.com/CliMA/Oceananigans.jl/blob/9454ebc7637e96f152c1aa0410875c1449c333b1/src/Models/nan_checker.jl#L40

shouldn't that work under the hood? same with distributed. otherwise users cannot reduce either without @apply_regionally

glwagner avatar May 20 '25 04:05 glwagner

I think it should work after https://github.com/CliMA/Oceananigans.jl/pull/4497

navidcy avatar May 20 '25 04:05 navidcy

Is this ready to merge once tests pass? now that #4520 is merged?

taimoorsohail avatar May 22 '25 01:05 taimoorsohail

No. First #4497 needs to get in

navidcy avatar May 22 '25 01:05 navidcy

Thanks!

taimoorsohail avatar May 22 '25 02:05 taimoorsohail