Mesh file for sparse grid for the NUOPC coupler
We need a mesh file that can be used with the NUOPC coupler for the sparse grid.
Here's a sample case for the MCT coupler:
/glade/work/oleson/PPE.n11_ctsm5.1.dev030/cime/scripts/ctsm51c8BGC_PPEn11ctsm51d030_2deg_GSWP3V1_Sparse400_Control_2000
@ekluzek in case it's relevant: I wrote instructions in this google doc for creating a high-res sparse grid mesh file for #1773 . Search for "mesh" in the doc to find the relevant section.
Very briefly:
- To start we need a file containing 1D or 2D variables of latitude and longitude for the grid of interest. If such file exists, I would be happy to try to generate the mesh file.
- If the lat/lon file also includes a mask of the sparse grid, we would then run mesh_mask_modifier to get that mask into the mesh file.
Using the new mesh_modifier tool, I was able to get a mesh file from the domain file. The mesh file for the atm forcing is different though in that it's not modifying the 2D grid it's a simple list of 400 points. So I need to create a SCRIP grid file that describes that list of points from the domain file, and then I convert it into ESMF mesh format.
So this is what you need to do:
- Generate scrip.nc from <file_with_lat_lon_2d>.nc where the latter I think is the domain file that you mentioned:
ncks --rgr infer --rgr scrip=scrip.nc <file_with_lat_lon_2d>.nc foo.nc(where foo.nc contains metadata and will not be used) - Generate mesh file from scrip.nc
module load esmf
ESMF_Scrip2Unstruct scrip.nc lnd_mesh.nc 0
(This mesh file’s mask = 1 everywhere) 3. At this point I suspect that you need to update this mesh file’s mask using the mesh_modifier tool to distinguish between land and ocean.
Awesome, thanks @slevisconsulting the above helped me to get mesh files created. I got everything setup Friday, but when I run the case it's failing. So I need to debug what's happening and get a working case. The mesh files I created are in...
/glade/work/erik/ctsm_worktrees/main_dev/cime_config/usermods_dirs/sparse_grid400_f19_mv17
Hopefully, the crash I'm seeing is something simple I can figure out.
The crash has to do with the connectivity on the forcing grid which is just a list of the 400 points. The suggestion from ESMF is that I make the forcing grid points vertices just be a tiny bit around the cell centers.
Because of the time it's taking to do this project I also plan to bring this work as a user-mod to master and add a test for it.
We talked about this at the standup this morning. An idea I got there was to try it with the new land mesh, but without the atm forcing mesh. I tried that and it works. So there's something going on with the new forcing that only has the 400 points. This is something I did suspect.
OK, I got a case to work! I couldn't use ncks to make the SCRIP grid file as it would "correct" my vertices to turn it into a regular grid. I was able to use curvilinear_to_SCRIP inside of NCL to write out a SCRIP grid file that I could then convert to a working mesh file. Using unstructured_to_ESMF inside of NCL didn't generate a mesh that I could use. One clue in the final mesh file I could see is that the nodeCount was 1600 (so 4x the number of points [400]) which shows that all of the points are isolated from each other. The mesh files that did NOT work all had a smaller number of total nodes than that which meant that they shared nodes between each other.
@ekluzek I understand that you used curvilinear_to_SCRIP instead of ncks, but I didn't follow what you ran to go from SCRIP to successful MESH.
Seems like this connects with #1919 too. In general we need better documentation on how to do this.
@ekluzek and I met to compare notes (this issue #1731 vs. discussion #1919):
- We think that his ncks attempt failed because he applied it to a "domain" file.
- I showed that ncks still works on a file containing explicit lat/lon variables, such as the datm file 1981-01.nc in #1919.
- Erik, since your case starts from the default 2-deg grid, I think you could use the ncks command on an existing 2-deg fsurdat file. Then generate the mesh with mask=1 everywhere. Then modify the mesh to the PPE sparse grid. And so on as explained in #1919...
@ekluzek and @slevis-lmwg I tried running a sparse grid simulation using the steps we talked about on the call today. I got a bunch of ESMF errors. It seems like I should be following what @ekluzek did above?
I'm not sure how to do what you did above, Erik. Do you remember the steps you took?
My log files for the case can be found :
/glade/scratch/afoster/ctsm51FATES_SP_OAAT_Control_2000/run
It looks like the issue is in datm, from the PET file as you point out. So one thing to check would be to see if just the change for the MASK_MESH works. I think that should, so that would be good to try.
Another thing to try would be the datm mesh file I created /glade/work/erik/ctsm_worktrees/main_dev/cime_config/usermods_dirs/sparse_grid400_f19_mv17/360x720_gswp3.0v1.c170606v4.dense400_ESMFmesh_c20221012.nc
which does differ from your file.
From reading above I ran into trouble with just using ncks, because it would change the vertices from me, so I couldn't use files created with it. I think that might be the warnings we saw when we worked on this that showed issues with the south pole.
(By the way the unclear messages from ESMF are another example of error checking that doesn't help you figure out the problem. I think this might be something where some better error checking could be done to help us figure out what's wrong).
You can also try the land mesh file I created...
/glade/work/erik/ctsm_worktrees/main_dev/cime_config/usermods_dirs/sparse_grid400_f19_mv17/fv1.9x2.5_sparse400_181205v4_ESMFmesh_c20220929.nc
NOTE: That for this you would replace use it for
ATM_DOMAIN_MESH LND_DOMAIN_MESH
And leave MASK_MESH as it was before.
OR -- you would reverse the mask and use it for MASK_MESH, and leave LND_DOMAIN_MESH/ATM_DOMAIN_MESH as they were.
I set it up changing ATM_DOMAIN_MESH/LND_DOMAIN_MESH because that is what made sense to me. But, as we saw with talking with @slevis-lmwg it's more general and simpler to swap out the MASK_MESH file.
That mesh file is also different from yours, and it's not just the mask. So again maybe there was something going on with the ncks conversion?
@adrifoster I do see four posts up that I wrote, "We think that his ncks attempt failed because he applied it to a "domain" file" ...followed by a suggestion how to resolve. So the "domain" shortcut probably did not work.
If things continue to fail, let's meet again and go through the full process, as I recommend it above. Let's plan on an hour.
It looks like the issue is in datm, from the PET file as you point out. So one thing to check would be to see if just the change for the MASK_MESH works. I think that should, so that would be good to try.
Okay I removed the info in user_nl_datm_streams and submit that (so the only thing that was changed was MASK_MESH. That failed for a different reason...
56: Which are NaNs = F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F T F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F F F F F F F F F
56: NaN found in field Sl_lfrin at gridcell index 88
56: ERROR: ERROR: One or more of the CTSM cap export_1D fields are NaN
87: # of NaNs = 3
87: Which are NaNs = F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
87: F F F F F F F F F F F F F F F F F F F F F F F F F F F F T F F F F F F F F F F F
87: F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F T
87: T F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
87: F F F F F F F F F F F F F F F F F F F F F F F F
87: NaN found in field Sl_lfrin at gridcell index 60
87: NaN found in field Sl_lfrin at gridcell index 111
87: NaN found in field Sl_lfrin at gridcell index 112
Will try your next suggestion next.
That still failed... thanks @slevis-lmwg for joining meeting with @mvertens this afternoon. @ekluzek do you want to join as well?
@adrifoster I will post my process here. Afterwards we can clean it up and post in #1919.
- Starting with this file
/glade/u/home/forrest/ppe_representativeness/output_v4/clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.ncgenerate landmask.nc: a) In matlab (% are comments)
rcent = ncread('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','rcent');
landmask = round(~isnan(rcent)); % "~" means "not"
nccreate('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','landmask','Dimensions',{'lon',144,'lat',96})
ncwrite('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','landmask',landmask);
nccreate('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','mod_lnd_props','Dimensions',{'lon',144,'lat',96})
ncwrite('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','mod_lnd_props',landmask);
OOPS... Somehow I had write permissions and was able to add the new variables directly to the clusters file. I hope I didn't mess anything up for anyone!
- After matlab
mv clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc landmask.nc
ncks --rgr infer --rgr scrip=scrip.nc landmask.nc foo.nc
/glade/u/apps/ch/opt/esmf-netcdf/8.0.0/intel/19.0.5/bin/bing/Linux.intel.64.mpiuni.default/ESMF_Scrip2Unstruct scrip.nc lnd_mesh.nc 0
@adrifoster with the above, I am still attempting a shortcut: I generated the mesh file directly from the landmask file, in order to skip the mesh_mask_modifier step.
In the run where you point to the default atmosphere drivers (not the sparse version), set the three mesh paths to /glade/scratch/slevis/temp_work/sparse_grid/lnd_mesh.nc
@adrifoster says that the above seems to have worked, so we will next try to apply the sparse grid on the datm data, as well.
From discussing with @olyson:
The 1D datm domain file came from combining the 2D gswp3 data and the dense400 mask. So... in matlab I will take the 1D mask from the domain file
/glade/p/cgd/tss/people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc
and put it in a 2D mask in a 2D gswp3 file
/glade/p/cgd/tss/CTSM_datm_forcing_data/atm_forcing.datm7.GSWP3.0.5d.v1.1.c181207/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc
That will be our starting file for the rest of the steps.
- In matlab:
longxy = ncread('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','LONGXY');
latixy = ncread('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','LATIXY');
lon = ncread('../../../people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','xc');
lat = ncread('../../../people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','yc');
mask = zeros(720,360);
for cell = 1:400
[i,j] = find(latixy==lat(cell) & longxy==lon(cell));
mask(i,j) = 1;
end
In a copy of the atmosphere file, so as to avoid overwriting the original, still in matlab:
nccreate('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','landmask','Dimensions',{'lon',720,'lat',360})
ncwrite('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','landmask',mask);
nccreate('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','mod_lnd_props','Dimensions',{'lon',720,'lat',360})
ncwrite('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','mod_lnd_props',mask);
- After matlab
mv clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc landmask.nc
ncks --rgr infer --rgr scrip=scrip.nc landmask.nc foo.nc
/glade/u/apps/ch/opt/esmf-netcdf/8.0.0/intel/19.0.5/bin/bing/Linux.intel.64.mpiuni.default/ESMF_Scrip2Unstruct scrip.nc lnd_mesh.nc 0
@adrifoster this mesh, intended for datm, is located here:
/glade/scratch/slevis/temp_work/sparse_grid/datm_mesh
You already have the other one, but I should mention that I moved it here:
/glade/scratch/slevis/temp_work/sparse_grid/land_mesh
@adrifoster I hope the run still works when you now point to the dense400 datm files and the datm mesh that I generated (previous post).
Unfortunately that did not work.
see logs /glade/scratch/afoster/ctsm51FATES_SP_OAAT_Control_2000_nuopctest/run
case is /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest
Thoughts:
- I think I see the problem:
- The datm data and the mct domain file are vectors with 400 elements.
- The mesh file that I created has data in 1D with
mask = 1according to the sparse grid (and 0 elsewhere), BUT for all the grid cells of the 360x720 domain (so it has 259920 elements).
- So the next thing I would suggest, in case it works: Point to the default global data (as in the case that worked) but point to the datm_mesh/lnd_mesh.nc file that I generated, rather than the default datm mesh. If NUOPC does with datm and the mesh what it does with fsurdat files and the land mesh, then this should work.
- Another idea (going back to the 400-element datm data) might be to take the datm_mesh/lnd_mesh.nc that I generated and come up with a 400 element version of it that contains only the data for mask = 1. This may or may not be easy. I could play with the file in matlab and see what I get. I don't know if I need to end up with exactly the same order of elements that appears in the datm files. I think NUOPC is smart enough to figure out the mapping, but maybe not...
- @adrifoster if (2) doesn't work, and you need to start simulations soon or now, I suggest starting them with the configuration that worked, because it may take longer to resolve this latest problem...
@adrifoster I will wait to hear whether (2) works and, if so, whether it's an acceptable solution before I start on (3).
Okay thanks! I am testing now whether adding just:
CLMGSWP3v1.TPQW:mapalgo = nn
CLMGSWP3v1.TPQW:meshfile =/glade/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh.nc
CLMGSWP3v1.Solar:mapalgo = nn
CLMGSWP3v1.Solar:meshfile =/glade/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh.nc
CLMGSWP3v1.Precip:mapalgo = nn
CLMGSWP3v1.Precip:meshfile =/glade/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh.nc
to user_nl_datm_streams will work
This seems to be working! Though it is quite slow... I'm not sure why
Hmm, does it seem slower that not using the datm_mesh file?
Yes, the intention is for the file to be faster, because it's doing less reading and no interpolation. If it's slower that's defeating the purpose of doing this part.
actually hold on that... I think i forgot to turn debug off.
Okay so it didn't really seem to change the timing much. In fact the one that used the mesh file for the DATM data was a bit slower:
Timing File for Case WITH DATM Mesh:
cat cesm_timing.ctsm51FATES_SP_OAAT_Control_2000_nuopctest.3834956.chadmin1.ib0.cheyenne.ucar.edu.231012-162352
---------------- TIMING PROFILE ---------------------
Case : ctsm51FATES_SP_OAAT_Control_2000_nuopctest
LID : 3834956.chadmin1.ib0.cheyenne.ucar.edu.231012-162352
Machine : cheyenne
Caseroot : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest
Timeroot : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest/Tools
User : afoster
Curr Date : Thu Oct 12 17:02:58 2023
Driver : CMEPS
grid : a%1.9x2.5_l%1.9x2.5_oi%null_r%null_g%null_w%null_z%null_m%gx1v7
compset : 2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV_SESP
run type : startup, continue_run = TRUE (inittype = FALSE)
stop option : nyears, stop_n = 10
run length : 3650 days (3650.0 for ocean)
component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 72 36 72 x 1 1 (1 )
atm = datm 36 0 36 x 1 1 (1 )
lnd = clm 72 36 72 x 1 1 (1 )
ice = sice 72 36 72 x 1 1 (1 )
ocn = socn 72 36 72 x 1 1 (1 )
rof = srof 72 36 72 x 1 1 (1 )
glc = sglc 72 36 72 x 1 1 (1 )
wav = swav 72 36 72 x 1 1 (1 )
esp = sesp 1 0 1 x 1 1 (1 )
total pes active : 108
mpi tasks per node : 36
pe count for cost estimate : 108
Overall Metrics:
Model Cost: 7.01 pe-hrs/simulated_year
Model Throughput: 369.95 simulated_years/day
Init Time : 6.237 seconds
Run Time : 2335.434 seconds 0.640 seconds/day
Final Time : 1.449 seconds
Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components
TOT Run Time: 2335.434 seconds 0.640 seconds/mday 369.95 myears/wday
CPL Run Time: 282.852 seconds 0.077 seconds/mday 3054.60 myears/wday
ATM Run Time: 1108.467 seconds 0.304 seconds/mday 779.45 myears/wday
LND Run Time: 1308.653 seconds 0.359 seconds/mday 660.22 myears/wday
ICE Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
OCN Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ROF Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
GLC Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
WAV Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
CPL COMM Time: 1110.305 seconds 0.304 seconds/mday 778.16 myears/wday
NOTE: min:max driver timers (seconds/day):
CPL (pes 36 to 107)
ATM (pes 0 to 35)
LND (pes 36 to 107)
ICE (pes 36 to 107)
OCN (pes 36 to 107)
ROF (pes 36 to 107)
GLC (pes 36 to 107)
WAV (pes 36 to 107)
ESP (pes 0 to 0)
Timing file for case WITHOUT DATM mesh:
cat cesm_timing.ctsm51FATES_SP_OAAT_Control_2000_nuopctest_noclim.3833019.chadmin1.ib0.cheyenne.ucar.edu.231012-153640
---------------- TIMING PROFILE ---------------------
Case : ctsm51FATES_SP_OAAT_Control_2000_nuopctest_noclim
LID : 3833019.chadmin1.ib0.cheyenne.ucar.edu.231012-153640
Machine : cheyenne
Caseroot : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest_noclim
Timeroot : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest_noclim/Tools
User : afoster
Curr Date : Thu Oct 12 16:14:10 2023
Driver : CMEPS
grid : a%1.9x2.5_l%1.9x2.5_oi%null_r%null_g%null_w%null_z%null_m%gx1v7
compset : 2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV_SESP
run type : startup, continue_run = TRUE (inittype = FALSE)
stop option : nyears, stop_n = 10
run length : 3650 days (3650.0 for ocean)
component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 72 36 72 x 1 1 (1 )
atm = datm 36 0 36 x 1 1 (1 )
lnd = clm 72 36 72 x 1 1 (1 )
ice = sice 72 36 72 x 1 1 (1 )
ocn = socn 72 36 72 x 1 1 (1 )
rof = srof 72 36 72 x 1 1 (1 )
glc = sglc 72 36 72 x 1 1 (1 )
wav = swav 72 36 72 x 1 1 (1 )
esp = sesp 1 0 1 x 1 1 (1 )
total pes active : 108
mpi tasks per node : 36
pe count for cost estimate : 108
Overall Metrics:
Model Cost: 6.72 pe-hrs/simulated_year
Model Throughput: 385.70 simulated_years/day
Init Time : 5.998 seconds
Run Time : 2240.084 seconds 0.614 seconds/day
Final Time : 1.129 seconds
Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components
TOT Run Time: 2240.084 seconds 0.614 seconds/mday 385.70 myears/wday
CPL Run Time: 281.403 seconds 0.077 seconds/mday 3070.33 myears/wday
ATM Run Time: 1009.835 seconds 0.277 seconds/mday 855.59 myears/wday
LND Run Time: 1311.185 seconds 0.359 seconds/mday 658.95 myears/wday
ICE Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
OCN Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ROF Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
GLC Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
WAV Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
CPL COMM Time: 1044.302 seconds 0.286 seconds/mday 827.35 myears/wday
NOTE: min:max driver timers (seconds/day):
CPL (pes 36 to 107)
ATM (pes 0 to 35)
LND (pes 36 to 107)
ICE (pes 36 to 107)
OCN (pes 36 to 107)
ROF (pes 36 to 107)
GLC (pes 36 to 107)
WAV (pes 36 to 107)
ESP (pes 0 to 0)
@adrifoster - what kind of timings were you getting with MCT? @jedwards4b - why are we seeing about 385 ypd - when the atm and land are running concurrently? Not sure where the cpl comm time is coming in. Can you have a look at this?