ClimaAtmos.jl icon indicating copy to clipboard operation
ClimaAtmos.jl copied to clipboard

Several weekly GPU longruns are broken

Open szy21 opened this issue 10 months ago • 2 comments

In the most recent build on 4/19, two simulations are broken.

longrun_aquaplanet_rhoe_equil_55km_nz63_clearsky_0M_earth_sleve. cc @akshaysridhar . Changing SST to a function of surface height results in behavior changes in this job.

longrun_aquaplanet_clearsky_1M . cc @trontrytel . I don't know why this job has behavior changes.

szy21 avatar Apr 20 '24 18:04 szy21

SSP baroclinic wave also seems consistenly broken: https://buildkite.com/clima/climaatmos-longruns/builds/1572#018f6607-605b-447a-b511-1111a4869090

Sbozzolo avatar May 15 '24 17:05 Sbozzolo

SSP baroclinic wave also seems consistenly broken: https://buildkite.com/clima/climaatmos-longruns/builds/1572#018f6607-605b-447a-b511-1111a4869090

Yeah, we have been ignoring this one. Hope we don't need it anymore after SSPKnoth:)

szy21 avatar May 15 '24 21:05 szy21

Would it make sense to look at all of those cases after SSPKnoth lands?

trontrytel avatar May 28 '24 16:05 trontrytel

Would it make sense to look at all of those cases after SSPKnoth lands?

Yes, I just tagged you to let you know there is an open issue. Also make sure they don't break diagnostic EDMF:)

szy21 avatar May 28 '24 18:05 szy21

We will revisit this after SSPKnoth; I'm adding some fixes to a bycolumn op (in the use of column_mapreduce) in our cached vars to see if I can fix the 2 breaking cases as of Friday's run (these were both working as of April 12)

akshaysridhar avatar May 28 '24 20:05 akshaysridhar

I found that PR https://github.com/CliMA/ClimaAtmos.jl/pull/2855 breaks one run that works otherwise with SSPKnoth

Sbozzolo avatar Jun 05 '24 02:06 Sbozzolo

I found that PR #2855 breaks one run that works otherwise with SSPKnoth

Interesting, so that PR breaks two of the working runs, although I see no reason that it should cause physical changes. What is the job that failed?

szy21 avatar Jun 05 '24 03:06 szy21

I found that PR #2855 breaks one run that works otherwise with SSPKnoth

Interesting, so that PR breaks two of the working runs, although I see no reason that it should cause physical changes. What is the job that failed?

https://buildkite.com/clima/climaatmos-ci/builds/19073#018fe547-837c-4eab-a70f-2691b203a9fa

Sbozzolo avatar Jun 05 '24 13:06 Sbozzolo

Could someone take another look at the PR https://github.com/CliMA/ClimaAtmos.jl/pull/2855 to see if the change in MSE is only from the change in the order of computation? Maybe @juliasloan25?

szy21 avatar Jun 05 '24 14:06 szy21

I am looking at it right now because it is blocking my work on Rosenbrock.

Sbozzolo avatar Jun 05 '24 14:06 Sbozzolo

I took one integration step (for the job that is failing for me) and compared before and after the patch. The results are very different:

julia> working.integrator.p.precomputed.sfc_conditions.ρ_flux_uₕ
ClimaCore.Geometry.AxisTensor{Float32, 2, Tuple{ClimaCore.Geometry.CovariantAxis{(3,)}, ClimaCore.Geometry.CovariantAxis{(1, 2)}}, StaticArraysCore.SMatrix{1, 2, Float32, 2}}-valued Field:
  components: 
    data: 
      1: Float32[-319.072, -378.13, -442.548, -463.268, -337.013, -383.945, -441.563, -477.315, -352.969, -396.443  …  -384.247, -446.425, -296.626, -348.484, -427.235, -464.666, -319.072, -378.13, -442.548, -463.268]
      2: Float32[-28.7554, -25.6094, -20.8332, -12.6396, -32.5333, -30.2907, -28.3866, -19.9347, -42.1406, -43.148  …  -392.587, -418.496, -344.003, -365.153, -395.624, -399.4, -347.827, -377.507, -388.069, -371.485]

julia> greg.integrator.p.precomputed.sfc_conditions.ρ_flux_uₕ
ClimaCore.Geometry.AxisTensor{Float32, 2, Tuple{ClimaCore.Geometry.CovariantAxis{(3,)}, ClimaCore.Geometry.CovariantAxis{(1, 2)}}, StaticArraysCore.SMatrix{1, 2, Float32, 2}}-valued Field:
  components: 
    data: 
      1: Float32[-695.62, -742.976, -759.328, -734.481, -681.863, -710.018, -727.113, -730.004, -653.286, -684.591  …  -596.337, -668.212, -443.825, -503.773, -595.453, -632.766, -444.601, -514.036, -581.144, -592.805]
      2: Float32[347.81, 345.716, 315.054, 284.463, 306.303, 295.782, 269.1, 252.041, 241.099, 233.439  …  -180.496, -189.359, -199.373, -209.864, -220.374, -218.469, -222.301, -239.188, -241.123, -229.592]

Sbozzolo avatar Jun 05 '24 15:06 Sbozzolo

I suggest that we revert it, unless @akshaysridhar or @glwagner understand why there is a substantial change in surface fluxes from that PR?

szy21 avatar Jun 05 '24 15:06 szy21

@Sbozzolo Could you check to see if ρ_flux_h_tot and ρ_flux_q_tot are very different as well?

szy21 avatar Jun 05 '24 15:06 szy21

Steps to reproduce:

First, checkout commit before/after change, e.g.

git checkout b51150ee4 

Then

import ClimaAtmos as CA
import SciMLBase
working = CA.get_simulation(CA.AtmosConfig("config/mpi_configs/mpi_sphere_aquaplanet_rhoe_equilmoist_clearsky.yml"))
SciMLBase.step!(working.integrator)
println(working.integrator.p.precomputed.sfc_conditions.ρ_flux_uₕ)

Sbozzolo avatar Jun 05 '24 15:06 Sbozzolo

@Sbozzolo Could you check to see if ρ_flux_h_tot and ρ_flux_q_tot are very different as well?

After the first step, they are the same.

Sbozzolo avatar Jun 05 '24 15:06 Sbozzolo

I suggest that we revert it, unless @akshaysridhar or @glwagner understand why there is a substantial change in surface fluxes from that PR?

(Maybe @dennisYatunin too)

Sbozzolo avatar Jun 05 '24 15:06 Sbozzolo

wait I think there is a bug in that PR. https://github.com/CliMA/ClimaAtmos.jl/blob/bb9c010f90ccf6acc9f5e3cd0480a6e273be764d/src/surface_conditions/surface_conditions.jl#L421 here is should be f = C12(f₁₃ * xz + f₂₃ * yz, L)?

szy21 avatar Jun 05 '24 15:06 szy21

@Sbozzolo could you try if that fixes the issue? (And we should somehow prevent this from happening in the future...)

szy21 avatar Jun 05 '24 15:06 szy21

wait I think there is a bug in that PR.

https://github.com/CliMA/ClimaAtmos.jl/blob/bb9c010f90ccf6acc9f5e3cd0480a6e273be764d/src/surface_conditions/surface_conditions.jl#L421 here is should be f = C12(f₁₃ * xz + f₂₃ * yz, L)?

Yes, this is a bug

akshaysridhar avatar Jun 05 '24 15:06 akshaysridhar

Will try this now.

Sbozzolo avatar Jun 05 '24 15:06 Sbozzolo

Yes, that works!

Sbozzolo avatar Jun 05 '24 15:06 Sbozzolo

Opening a PR now

Sbozzolo avatar Jun 05 '24 15:06 Sbozzolo

The longrun experiments have been cleaned up and most of these are irrelevant now. Closing the issue for now.

szy21 avatar Jul 28 '24 22:07 szy21