E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

EAMxx: site output from 1-to-1 extraction has more zeros and inconsistent values with native output at some locations

Open jsbamboo opened this issue 7 months ago • 11 comments

There was a problem in the site_nearest output in the wprrmxx tests after sparse output fixes:

  • site_nearest online remapping output with PR7350 fix or my local "map_file_min_col = 1;" fix is inconsistent with the manual extraction of ncol from native-grid output WHEN site_output_file's landfrac + ocnfrac is not equal to one.
  • site_output_file's landfrac and ocnfrac are wrong; native_output_file's landfrac and ocnfrac are correct.

I'm not sure if it's a problem in the horiz remapper (as it ONLY happens in the site_nearest output! No problem in the global -> 1x1 WP basin-average output, neither in site_ocean_weighted output, though all are sparse output related to the PR7350 fix) or it reflects other issues related to landfrac/ocnfrac. I guess it only happens during 1-to-1 direct-extraction from the native grid to the site output, i.e., there is no problem with site output when an average operator is applied to the source columns associated with a destination grid.

================= problem:

  1. site_nearest online remapping output with PR7350 fix or my local "map_file_min_col = 1;" fix is inconsistent with the manual extraction of ncol from native-grid output WHEN site_output_file's landfrac + ocnfrac is not equal to one;
    • different vars have different error patterns, e.g., diag_equiv_reflectivity is more wrong than ps and VapWaterPath
    • different runs (first build .vs. afterwords) give different results at some sites; see below for O9 vs. O10/O8 tests, 18/26/27/29 are different
    • different fixes (PR7350 vs. local map_file_min_col = 1 fix) give the same results; see below for O10 vs. O8 tests, except site's landfrac is -nan in O8 and 0 in O10

mapping file:

  • made it by the simplest extract-ncol-index method, so the col-1 in the mapping file corresponds to the ncol index in eamxx and native-grid's index
  • file location: /p/lustre1/zhang73/E3SM_simulations/100m/post/save.RRMicol-ISD_box0.2_1site_all_avg_ocn_land.WPRRM_19730101-20241231.WP10ne32x32v1.nearest.5.nc S = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ; col = 78440, 93092, 30758, 45244, 48218, 85479, 30884, 30865, 26245, 26397, 26350, 83814, 25809, 64636, 62551, 78453, 25740, 28476, 28514, 26341, 26154, 26350, 26154, 30758, 30758, 45584, 28506, 28504, 48218, 28504, 30758, 85479 ;
  • ONLY happens in the site_nearest output!!! No problem in the global -> 1x1 WP basin-average output No problem in the global -> 1x1 WP basin-average output, neither in site_ocean_weighted output, though all are sparse output related to the PR7350 fix??

codebase:

  • WPRRMxx + p3-extra-diags + local fix of cosp_Nend + PR7350 (site_nearest output): /p/lustre2/zhang73/GitTmp/WPRRMxx_cospNend_coloffset_ctp

patterns (see more details in issue-site1to1_results.log):

  • P1 site_id = 3,4,5,7,11,13,14,15,17,18,25,26,27,28,29,31:
    • site's landfrac=0 or -nan, ocnfrac=0, pg2's landfrac=0 and ocnfrac=1; online-remapping = 0, site's ocnfrac actually should = 1? diag problem of ocnfrac and landfrac??
  • P2 site_id = 18,26:
    • site's landfrac=ocnfrac=0, pg2's landfrac=0 and ocnfrac=1; inconsistent (diag_equiv_reflectivity) or O (ps); similar to 3,4,5 but has a minor value
  • P3 site_id = 27,29:
    • site's landfrac=ocnfrac=0 or -nan, pg2's landfrac=0 and ocnfrac=1; inconsistent (diag_equiv_reflectivity) or 0 (ps, diag_equiv_reflectivity); similar to 18, 26 but has a big value
  • other notable issue:
    • all sites' landfrac only show missing-value at step 0, and will change to 0/nan after step 1
    • ocnfrac is only set correctly to 1 when landfrac = missing at step 0

summary/guess:

  • ocnfrac should be 1 for all error patterns P1-P3
  • landfrac shouldn't have nan (should be 0/missing-value in this case?)
  • ocnfrac / landfrac outputs for site online-mapping very likely have a problem? however, I didn't find any places using ocnfrac / landfrac in the horiz_interp_remapper_data
  • it might be an I/O problem invoked by isolated-ncol outputs, but no problem in other sparse output if the dstination grid is not 1v1 corresponding to native-grid's cols (i.e., not for 1-to-1 direct-extraction)?

================= check result details: issue-site1to1_results.log

check script: grid_WL.debug.horiz_remapper.site_nearest.ncl
; command: ncl 'run="O10"' grid_WL.debug.horiz_remapper.site_nearest.ncl > O10_check

; check_group = (/"O10-1","O10","O9","O8"/)
check_group=run
;------------------------------------------------------------------------------------------------------------
do ig=0,dimsizes(check_group)-1
    ; same filenames across group
    ; global pg2
    ; site_nearest
    file_b = "eamxx_wprrm.5minI.site_nearest.h.INSTANT.nmins_x5.2014-10-01-00000.nc"
    file_d = "/p/lustre1/zhang73/E3SM_simulations/100m/post/save.RRMicol-ISD_box0.2_1site_all_avg_ocn_land.WPRRM_19730101-20241231.WP10ne32x32v1.nearest.5.nc"
    ; site_ocean_weighted
    file_b1 = "eamxx_wprrm.5minI.site_ocean_weighted.h.INSTANT.nmins_x5.2014-10-01-00000.nc"
    file_d1 = "/p/lustre1/zhang73/E3SM_simulations/100m/post/save.RRMicol-ISD_box0.2_1site_all_avg_ocn_land.WPRRM_19730101-20241231.WP10ne32x32v1.ocean_weighted.5.nc"
    file_a = "eamxx_wprrm.5minI.pg2.h.INSTANT.nmins_x5.2014-10-01-00000.nc" 
;==================
if (check_group(ig).eq."O10-1")then 
    dir = "/p/lustre1/zhang73/E3SM_simulations/wprrmxx_p3/WPRRMxx_cospNend_ctp_PR7350_p3.WP10ne32x32v1pg2_WP10ne32x32v1pg2.F2010-SCREAMv1.dane/tests/1120x1_nhoursx1_UVTQ3h-s20141001-finicold-cosp3rad3tendx2-O10-1all/run/"
    file_c = "eamxx_wprrm.5minI.pg2.h.INSTANT.nmins_x5.2014-10-01-00000.nc" 
end if 
if (check_group(ig).eq."O10")then 
    dir = "/p/lustre1/zhang73/E3SM_simulations/wprrmxx_p3/WPRRMxx_cospNend_ctp_PR7350_p3.WP10ne32x32v1pg2_WP10ne32x32v1pg2.F2010-SCREAMv1.dane/tests/1120x1_nhoursx1_UVTQ3h-s20141001-finicold-cosp3rad3tendx2-O10all/run/"
    file_c = "eamxx_wprrm.5minI.pg2.h.INSTANT.nmins_x5.2014-10-01-00000.nc" 
end if 
if (check_group(ig).eq."O9")then 
    dir = "/p/lustre1/zhang73/E3SM_simulations/wprrmxx_p3/WPRRMxx_cospNend_ctp_PR7350_p3.WP10ne32x32v1pg2_WP10ne32x32v1pg2.F2010-SCREAMv1.dane/tests/1120x1_nhoursx1_UVTQ3h-s20141001-finicold-cosp3rad3tendx2-O9all/run/"
    file_c = "eamxx_wprrm.1hI.site_pg2_debug.h.INSTANT.nhours_x1.2014-10-01-00000.nc" 
end if 
if (check_group(ig).eq."O8")then 
    dir = "/p/lustre1/zhang73/E3SM_simulations/wprrmxx_p3/WPRRMxx_cospNend_coloffset_ctp_p3.WP10ne32x32v1pg2_WP10ne32x32v1pg2.F2010-SCREAMv1.dane/tests/1120x1_nhoursx1_UVTQ3h-s20141001-finicold-cosp3rad3tendx2-O8all/run/"
    file_c = "eamxx_wprrm.1hI.site_pg2_debug.h.INSTANT.nhours_x1.2014-10-01-00000.nc" 
end if 
print("dir    : "+dir    )
print("file_a : "+file_a )
print("file_c : "+file_c )
print("file_b : "+file_b )
print("file_d : "+file_d )
print("file_b1: "+file_b1)
print("file_d1: "+file_d1)
;------------------------------------------------------------------------------------------------------------
    ;---command
    a=addfile(dir+file_a,"r")
    c=addfile(dir+file_c,"r")
    b=addfile(dir+file_b,"r")
    d=addfile(file_d,"r")
    b1=addfile(dir+file_b1,"r")
    d1=addfile(file_d1,"r")
    colind=d->col -1
    if (check_group(ig).eq."O10")then 
    print(" // "+check_group(ig)+" [site_nearest] ps")
    print(a->ps(2,colind(:))+" vs. "+b->ps(2,:)+" ||| site's frac: "+b->landfrac(0,:)+", "+b->ocnfrac(1,:)+"; pg2's frac: "+c->landfrac(1,colind(:))+", "+c->ocnfrac(1,colind(:)))  
    print("")
    end if

    print(" // "+check_group(ig)+" [site_nearest] diag_equiv_reflectivity at lev 127 ")
    print(a->diag_equiv_reflectivity(2,colind(:),127)+" vs. "+b->diag_equiv_reflectivity(2,:,127)+" ||| site's frac: "+b->landfrac(0,:)+", "+b->ocnfrac(1,:)+"; pg2's frac: "+c->landfrac(1,colind(:))+", "+c->ocnfrac(1,colind(:)))   
    print("")

    print(" // "+check_group(ig)+" [site_ocean_weighted] diag_equiv_reflectivity at lev 127 ")
    do row_in_mapfile=1,31
        colind_dst3 := d1->col(ind(d1->row.eq.row_in_mapfile)) -1
        S_dst3 := d1->S(ind(d1->row.eq.row_in_mapfile))
        var_src_aavewgt = sum(a->diag_equiv_reflectivity(2,colind_dst3,127)*S_dst3)
        landfrac_src_aavewgt = sum(c->landfrac(1,colind_dst3)*S_dst3)
        ocnfrac_src_aavewgt = sum(c->ocnfrac(1,colind_dst3)*S_dst3)
        print("row_in_mapfile-1 = "+(row_in_mapfile-1)+": "+var_src_aavewgt +" vs. "+b1->diag_equiv_reflectivity(2,row_in_mapfile-1,127)+" ||| site's frac: "+b1->landfrac(0,row_in_mapfile-1)+", "+b1->ocnfrac(1,row_in_mapfile-1)+"; pg2's frac: "+landfrac_src_aavewgt+", "+ocnfrac_src_aavewgt)
    end do 
    print("")

    exit
end do ;check_group
;------------------------------------------------------------------------------------------------------------

jsbamboo avatar Jun 02 '25 13:06 jsbamboo

I'm not sure I follow. Some questions:

  • what do you mean with "it happens only at site_nearest output"? I guess I'm not sure what "site_nearest" mean.
  • what is "local map_file_min_col = 1 fix"?
  • can we see the actual map files?

bartgol avatar Jun 02 '25 21:06 bartgol

sure!

  • "site_nearest" means that the site output mapping file was created by finding the nearest ncol indexes to the target sites. for this mapping file, col and row have a 1-to-1 correspondence, or say, row contains no duplicate values. In contrast, for "site_ocean_weighted", each row corresponds to more than one cols (specifically, it was created by finding the columns within the 0.2 deg × 0.2 deg box where landfrac == 0.0, and computing weightes using the area_col/sum(area_col) for each row)

  • local map_file_min_col = 1 fix:

--- a/components/eamxx/src/share/grid/remap/horiz_interp_remapper_data.cpp
+++ b/components/eamxx/src/share/grid/remap/horiz_interp_remapper_data.cpp
@@ -95,7 +95,8 @@ get_my_triplets (const std::string& map_file) const
   int map_file_min_col = std::numeric_limits<int>::max();
   for (int id=0; id<nlweights; id++) {
     map_file_min_row = std::min(rows[id],map_file_min_row);
-    map_file_min_col = std::min(cols[id],map_file_min_col);
+    //map_file_min_col = std::min(cols[id],map_file_min_col);
+    map_file_min_col = 1;
   }
   int global_map_file_min_row, global_map_file_min_col;
   comm.all_reduce(&map_file_min_row,&global_map_file_min_row,1,MPI_MIN);
  • please find the mapping files dir:
    https://portal.nersc.gov/cfs/e3sm/zhang73/forsite/mapfiles/

    • the site_nearest map (1-to-1, gave problematic results):
      https://portal.nersc.gov/cfs/e3sm/zhang73/forsite/mapfiles/save.RRMicol-ISD_box0.2_1site_all_avg_ocn_land.WPRRM_19730101-20241231.WP10ne32x32v1.nearest.5.nc
    • the site_ocean_weighted map (N-to-1, no problem!):
      https://portal.nersc.gov/cfs/e3sm/zhang73/forsite/mapfiles/save.RRMicol-ISD_box0.2_1site_all_avg_ocn_land.WPRRM_19730101-20241231.WP10ne32x32v1.ocean_weighted.5.nc
  • the case dir here: https://portal.nersc.gov/cfs/e3sm/zhang73/forsite/WPRRMxx_cospNend_ctp_PR7350_p3.WP10ne32x32v1pg2_WP10ne32x32v1pg2.F2010-SCREAMv1.dane/tests/1120x1_nhoursx1_UVTQ3h-s20141001-finicold-cosp3rad3tendx2-O10-1all/

    • the yaml files: https://portal.nersc.gov/cfs/e3sm/zhang73/forsite/WPRRMxx_cospNend_ctp_PR7350_p3.WP10ne32x32v1pg2_WP10ne32x32v1pg2.F2010-SCREAMv1.dane/tests/1120x1_nhoursx1_UVTQ3h-s20141001-finicold-cosp3rad3tendx2-O10-1all/run/data/

jsbamboo avatar Jun 04 '25 14:06 jsbamboo

Looking at your site_nearest map file, I see

 n_a = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
    20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
    38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
    56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 
...

This makes me think that maybe your col var

 col = 78440, 93092, 30758, 45244, 48218, 85479, 30884, 30865, 26245, 26397, 
    26350, 83814, 25809, 64636, 62551, 78453, 25740, 28476, 28514, 26341, 
    26154, 26350, 26154, 30758, 30758, 45584, 28506, 28504, 48218, 28504, 
    30758, 85479 ;

contains 0-based indices. That is NOT what the horiz remapper expects. We are ASSUMING that the row/col indices in the map file are 1-based. I thought about giving the opportunity to specify the index base (0 or 1), but since most map files are generated with ncremap (or similar tools) which use 1-based indexing, I decided to just do the simple thing.

If you suspect this is the case (or have no idea and want to try), you could modify the map file, so that the col indices are all increased by 1. If that fixes it, then we know what the deal is.

bartgol avatar Jun 05 '25 18:06 bartgol

That was one of the initial guesses, however, the results show that many site values from the site_nearest online remapping are actually correct vs. the native-pg2 direct extraction (i.e., not all are wrong). In addition, the check script ensures that the "col - 1" operation is applied when reading the mapping file, as shown in "colind = d->col - 1"

please see the test results below using the modified map file where all column indices were increased by 1, which produced a similar pattern (though of course, the specific values changed accordingly vs those in the previous tests as the column indices changed):

  • the modified map:
    file_d : /p/lustre1/zhang73/E3SM_simulations/100m/post/save.RRMicol-ISD_box0.2_1site_all_avg_ocn_land.WPRRM_19730101-20241231.WP10ne32x32v1.nearest_add1.5.nc

ncdump -v col comparison for the previous site_nearest map vs. the modified site_nearest_add1 map: (site_nearest) col = 78440, 93092, 30758, 45244, 48218, 85479, 30884, 30865, 26245, 26397, ... (site_nearest_add1) col = 78441, 93093, 30759, 45245, 48219, 85480, 30885, 30866, 26246, 26398, ...

O11_check
  1  Copyright (C) 1995-2019 - All Rights Reserved
  2  University Corporation for Atmospheric Research
  3  NCAR Command Language Version 6.6.2
  4  The use of this software is governed by a License Agreement.
  5  See http://www.ncl.ucar.edu/ for more details.
  6 (0)     dir    : /p/lustre1/zhang73/E3SM_simulations/wprrmxx_p3/WPRRMxx_cospNend_ctp_PR7350_p3.WP10ne32x32v1pg2_WP10ne32x32v1pg2.F2010-SCREAMv1.dane/tests/1120x1_nhoursx1_UVTQ3h-s20141001-finicold-cosp3rad3    tendx2-O11all/run/
  7 (0)     file_a : eamxx_wprrm.5minI.pg2.h.INSTANT.nmins_x5.2014-10-01-00000.nc
  8 (0)     file_c : eamxx_wprrm.5minI.pg2.h.INSTANT.nmins_x5.2014-10-01-00000.nc
  9 (0)     file_b : eamxx_wprrm.5minI.site_nearest_add1.h.INSTANT.nmins_x5.2014-10-01-00000.nc
 10 (0)     file_d : /p/lustre1/zhang73/E3SM_simulations/100m/post/save.RRMicol-ISD_box0.2_1site_all_avg_ocn_land.WPRRM_19730101-20241231.WP10ne32x32v1.nearest_add1.5.nc
 11 (0)     file_b1: eamxx_wprrm.5minI.site_ocean_weighted.h.INSTANT.nmins_x5.2014-10-01-00000.nc
 12 (0)     file_d1: /p/lustre1/zhang73/E3SM_simulations/100m/post/save.RRMicol-ISD_box0.2_1site_all_avg_ocn_land.WPRRM_19730101-20241231.WP10ne32x32v1.ocean_weighted.5.nc
 13 (0)      // O11 [site_nearest] diag_equiv_reflectivity at lev 127
 14 (0)     7.43995 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 15 (1)     -11.4951 vs. -11.4951 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 16 (2)     -36.9897 vs. -36.9897 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 17 (3)     2.12756 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 18 (4)     37.1854 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0.905096, 0.0949038
 19 (5)     5.43674 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 20 (6)     5.14271 vs. 5.14271 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 21 (7)     7.96691 vs. 1.44261 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0.317978, 0.682022
 22 (8)     -0.534533 vs. -0.534533 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 23 (9)     21.9402 vs. 21.9402 ||| site's frac: 3.40282e+33, 0.623479; pg2's frac: 0.376521, 0.623479
 24 (10)    -10.0606 vs. -10.0606 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 25 (11)    -10.15 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 26 (12)    -36.9897 vs. -36.9897 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 27 (13)    9.97578 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 28 (14)    -21.7808 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 29 (15)    7.17955 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 30 (16)    4.39431 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 31 (17)    1.44261 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 32 (18)    7.29138 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 33 (19)    -2.57307 vs. -2.57307 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 34 (20)    24.2374 vs. 24.2374 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 35 (21)    -10.0606 vs. -10.0606 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 36 (22)    24.2374 vs. 24.2374 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 37 (23)    -36.9897 vs. -36.9897 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 38 (24)    -36.9897 vs. -36.9897 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 39 (25)    -36.9897 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 40 (26)    7.04975 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 41 (27)    6.81085 vs. 0.212253 ||| site's frac: 0, 0; pg2's frac: 0, 1
 42 (28)    37.1854 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0.905096, 0.0949038
 43 (29)    6.81085 vs. 0.206512 ||| site's frac: 0, 0; pg2's frac: 0, 1
 44 (30)    -36.9897 vs. -36.9897 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 45 (31)    5.43674 vs. 0 ||| site's frac: 0, 0; pg2's frac: 0, 1
 46 (0)
 47 (0)      // O11 [site_ocean_weighted] diag_equiv_reflectivity at lev 127
 48 (0)     row_in_mapfile-1 = 0: 7.11385 vs. 7.11385 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 49 (0)     row_in_mapfile-1 = 1: -11.6074 vs. -11.6074 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 50 (0)     row_in_mapfile-1 = 2: -36.9897 vs. -36.9897 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 51 (0)     row_in_mapfile-1 = 3: 0.426398 vs. 0.426398 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 52 (0)     row_in_mapfile-1 = 4: 32.7235 vs. 32.7235 ||| site's frac: 3.40282e+33, 0.959827; pg2's frac: 0.0401732, 0.959827
 53 (0)     row_in_mapfile-1 = 5: 1.16704 vs. 1.16704 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 54 (0)     row_in_mapfile-1 = 6: 3.52491 vs. 3.52491 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 55 (0)     row_in_mapfile-1 = 7: -3.49332 vs. -3.49332 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 56 (0)     row_in_mapfile-1 = 8: 30.5303 vs. 30.5303 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 57 (0)     row_in_mapfile-1 = 9: -8.11873 vs. -8.11873 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 58 (0)     row_in_mapfile-1 = 10: -10.3518 vs. -10.3518 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 59 (0)     row_in_mapfile-1 = 11: -36.9897 vs. -36.9897 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 60 (0)     row_in_mapfile-1 = 12: 4.32098 vs. 4.32098 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 61 (0)     row_in_mapfile-1 = 13: -15.2885 vs. -15.2885 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 62 (0)     row_in_mapfile-1 = 14: 7.15237 vs. 7.15237 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 63 (0)     row_in_mapfile-1 = 15: -1.44157 vs. -1.44157 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 64 (0)     row_in_mapfile-1 = 16: 5.01134 vs. 5.01134 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 65 (0)     row_in_mapfile-1 = 17: 3.91312 vs. 3.91312 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 66 (0)     row_in_mapfile-1 = 18: -8.67066 vs. -8.67066 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 67 (0)     row_in_mapfile-1 = 19: 17.5825 vs. 17.5825 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 68 (0)     row_in_mapfile-1 = 20: -8.11873 vs. -8.11873 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 69 (0)     row_in_mapfile-1 = 21: 17.5825 vs. 17.5825 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 70 (0)     row_in_mapfile-1 = 22: -36.9897 vs. -36.9897 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 71 (0)     row_in_mapfile-1 = 23: -36.9897 vs. -36.9897 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 72 (0)     row_in_mapfile-1 = 24: -36.9897 vs. -36.9897 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 73 (0)     row_in_mapfile-1 = 25: 7.10131 vs. 7.10131 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 74 (0)     row_in_mapfile-1 = 26: 4.21952 vs. 4.21952 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 75 (0)     row_in_mapfile-1 = 27: 32.7235 vs. 32.7235 ||| site's frac: 3.40282e+33, 0.959827; pg2's frac: 0.0401732, 0.959827
 76 (0)     row_in_mapfile-1 = 28: 4.21952 vs. 4.21952 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 77 (0)     row_in_mapfile-1 = 29: -36.9897 vs. -36.9897 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1
 78 (0)     row_in_mapfile-1 = 30: 1.16704 vs. 1.16704 ||| site's frac: 3.40282e+33, 1; pg2's frac: 0, 1

jsbamboo avatar Jun 06 '25 15:06 jsbamboo

That's bizarre. I would have to spend some time to debug this issue. Unfortunately I'm leaving for a 3w vacation tomorrow, but I can investigate upon my return.

bartgol avatar Jun 06 '25 15:06 bartgol

thanks a lot for your support - no rush and really appreciate your time!

jsbamboo avatar Jun 06 '25 16:06 jsbamboo

@jsbamboo what's your conclusion on this? Any updates?

Does this introduce scientifically invalid output in your cases?

mahf708 avatar Jun 17 '25 23:06 mahf708

@mahf708 thanks for checking this! There is no update on my side. The conclusion is that in a corner case—specifically, when using the 1-to-1 extraction method with online remapping—the site output files contain SOME sites with zeros or other incorrect values, while other sites are correct. This was verified by comparing the site output files with the global pg2 output. In my cases, yes, this leads to scientifically invalid results.

The most telling clue is that all the incorrect sites (ncols) also have incorrect landfrac and ocnfrac values. Based on this, I summarized the condition/pattern at the beginning:

WHEN site_output_file's landfrac + ocnfrac is not equal to one

I’d also like to note that both the PR #7350 fix and the local map_file_min_col = 1 fix have resolved most of the sparse output issues described in #7346. The problem I'm reporting here is a new issue that appeared after those fixes were applied. It may not be due to online remapping itself, but rather something related to landfrac and ocnfrac.

Please let me know if anything is unclear. I realized this is a tough issue to explain, and it's a subtle corner case

jsbamboo avatar Jun 18 '25 03:06 jsbamboo

@jsbamboo thanks!!!

Can you describe this part in a bit more detail:

corner case—specifically, when using the 1-to-1 extraction method with online remapping—

I am not sure I follow exactly what you mean by this. What would be the default (non-corner case)?

mahf708 avatar Jun 18 '25 12:06 mahf708

@mahf708 yw! let me try to describe how these mapping files are created, hopefully this helps clarify things a bit...

Group A. Non-corner cases (i.e., all values are correct) include:

  1. Regional average mapping file: maps a regional pg2 area to a single destination gridcell, effectively acting as one “big-site” output. Generated by ncremap:

ncremap -a nco -s WP10ne32x32v1pg2_scrip.nc -g 1x1_160E170E2.5N12.5N_SCRIP_ncremap.nc -m map_WP10ne32x32v1pg2_to_1x1_160E170E2.5N12.5N.nco.20250406.nc col (source gridcells): the ncol indexes of the pg2 grid within the target region row (destination gridcell): always 1 (repeated), i.e., multiple source gridcells to the same destination gridcell Requires horizontal averaging? Yes

File path on NERSC: /global/cfs/cdirs/e3sm/www/zhang73/forsite/mapfiles/map_WP10ne32x32v1pg2_to_1x1_160E170E2.5N12.5N.nco.20250406.nc

  1. Box-averaged mapping per site: for each site, a small region of pg2 gridcells is averaged into one destination gridcell. Similar to group A1 but multiple sites are handled in one file. Generated by a script:

Input: lat-lon coordinates of the target sites box size: 0.2 deg × 0.2 deg col: the ncol indexes of the pg2 grid within the box around each site row: values range from 1 to N(sites), with multiple entries per site Requires horizontal averaging? Yes

File paths on NERSC: /global/cfs/cdirs/e3sm/www/zhang73/forsite/mapfiles/save.RRMicol-ISD_box0.2_1site_all_avg_ocn_land.WPRRM_19730101-20241231.WP10ne32x32v1.ocean_weighted.5.nc /global/cfs/cdirs/e3sm/www/zhang73/forsite/mapfiles/save.RRMicol-ISD_box0.2_1site_all_avg_ocn_land.WPRRM_19730101-20241231.WP10ne32x32v1.land_weighted.5.nc /global/cfs/cdirs/e3sm/www/zhang73/forsite/mapfiles/save.RRMicol-ISD_box0.2_1site_all_avg_ocn_land.WPRRM_19730101-20241231.WP10ne32x32v1.box_weighted.5.nc

Group B. The corner case:

  1. Per-site nearest-neighbor (1-to-1) extraction: each site is mapped from one pg2 source gridcell to one destination gridcell, applied across multiple sites. Generated by a script:

Input: same as group A2, lat-lon coordinates of the target sites col: the unique ncol index of the pg2 grid corresponding to each site row: values range from 1 to N (sites), with one unique entry per site Requires horizontal averaging? No (direct extraction). In this case, the weight matrix S contains only 1s—each destination gridcell receives the value from its nearest source gridcell.

File path on NERSC: /global/cfs/cdirs/e3sm/www/zhang73/forsite/mapfiles/save.RRMicol-ISD_box0.2_1site_all_avg_ocn_land.WPRRM_19730101-20241231.WP10ne32x32v1.nearest.5.nc

Please find below the core function of the script used to generate all these site output (group A2 and group B1) mapping files:

def save_mapping_netcdf
def save_mapping_netcdf(output_file, station_names, station_ids, station_lat, station_lon,
                        nearest_icol, box_ncol_arr, icol_arr, marea_arr, mlats_arr, mlons_arr, mlandfrac_arr, ncol):
    mappings = defaultdict(lambda: defaultdict(list))  # mapping[mode] = dict of arrays

    n_stations = len(station_ids)
    for mode in ["nearest", "box_all", "box_weighted", "ocean_weighted", "land_weighted"]:
        row_list, col_list, S_list = [], [], []
        lat_list, lon_list, site_list = [], [], []
        station_name_list, station_id_list, station_lat_list, station_lon_list = [], [], [], []

        current_row = 1
        for i in range(n_stations):
            site_id = i + 1  # make it 1-based
            station_name = station_names[i]
            station_id = station_ids[i]
            slat = station_lat[i]
            slon = station_lon[i]
            weights = [1.0]

            if mode == "nearest":
                icols = [nearest_icol[i]]
                lats = [slat]
                lons = [slon]
            else:
                if mode == "box_all":
                    flag = 0
                elif mode == "ocean_weighted":
                    flag = 1
                elif mode == "land_weighted":
                    flag = 2
                else:
                    flag = 0  # box_weighted uses all grids

                n_valid = box_ncol_arr[flag, i]
                if n_valid == 0:
                    continue  # skip sites with no valid grid points

                icols_all = icol_arr[flag, i]
                areas_all = marea_arr[flag, i]
                lats_all = mlats_arr[flag, i]
                lons_all = mlons_arr[flag, i]

                #--- valid grid points:icol >= 0 ---
                valid = icols_all >= 0
                icols = icols_all[valid]
                areas = areas_all[valid]
                lats = lats_all[valid]
                lons = lons_all[valid]

                if mode == "box_all":
                    weights = np.ones_like(icols, dtype=np.float32)
                else:
                    total_area = np.sum(areas)
                    weights = areas / total_area if total_area > 0 else np.zeros_like(areas)

            if mode == "box_all":
                #--- box_all: src gridcells within the box -> one unique row ---
                for c, s, lat, lon in zip(icols, weights, lats, lons):
                    row_list.append(len(row_list) + 1)  # each row is unique
                    col_list.append(c)
                    S_list.append(s)
                    lat_list.append(lat)
                    lon_list.append(lon)
                    site_list.append(site_id)
                    station_name_list.append(station_name)
                    station_id_list.append(station_id)
                    station_lat_list.append(slat)
                    station_lon_list.append(slon)
            else:
                #--- weighted: multiple src grid -> each row, S being normalized ---
                #--- "nearest" degenerates to unique col -> unique row ---
                for c, s, lat, lon in zip(icols, weights, lats, lons):
                    row_list.append(current_row)
                    col_list.append(c)
                    S_list.append(s)
                    lat_list.append(lat)
                    lon_list.append(lon)
                    site_list.append(site_id)
                    station_name_list.append(station_name)
                    station_id_list.append(station_id)
                    station_lat_list.append(slat)
                    station_lon_list.append(slon)

                current_row += 1

        mappings[mode]["row"] = np.array(row_list, dtype=np.int32)
        mappings[mode]["col"] = np.array(col_list, dtype=np.int32)
        mappings[mode]["S"] = np.array(S_list, dtype=np.float32)
        mappings[mode]["lat"] = np.array(lat_list, dtype=np.float32)
        mappings[mode]["lon"] = np.array(lon_list, dtype=np.float32)
        mappings[mode]["site_ID"] = np.array(site_list, dtype=np.int32)
        mappings[mode]["station_name"] = np.array(station_name_list, dtype="U64")  # Unicode
        mappings[mode]["station_id"] = np.array(station_id_list, dtype=np.int64)
        mappings[mode]["station_lat"] = np.array(station_lat_list, dtype=np.float32)
        mappings[mode]["station_lon"] = np.array(station_lon_list, dtype=np.float32)

    #--- writ to netcdf ---
    for mode, data in mappings.items():
        mode_outfile = output_file.replace(".nc", f".{mode}.nc")
        n_s = len(data["row"])
        n_a = ncol
        n_b = n_s
        ds_map = xr.Dataset(
            {
                "col": ("n_s", data["col"]),
                "row": ("n_s", data["row"]),
                "S": ("n_s", data["S"]),
                "lat": ("n_s", data["lat"]),
                "lon": ("n_s", data["lon"]),
                "site_ID": ("n_s", data["site_ID"]),
                "station_name": ("n_s", data["station_name"]),
                "station_id": ("n_s", data["station_id"]),
                "station_lat": ("n_s", data["station_lat"]),
                "station_lon": ("n_s", data["station_lon"]),
            },
            coords={
                "n_s": np.arange(n_s),
                "n_a": np.arange(n_a),
                "n_b": np.arange(n_b),
            }
        )
        ds_map.to_netcdf(mode_outfile)

        mode_outfile_cdf5 = output_file.replace(".nc", f".{mode}.5.nc")
        convert_to_cdf5_ncks(mode_outfile, mode_outfile_cdf5)
        print(f"[{mode}] Mapping NetCDF saved to: {mode_outfile_cdf5}")

jsbamboo avatar Jun 18 '25 16:06 jsbamboo

I just found that the "site_box_all" remapped stream also produces correct results. This output includes all pg2 gridcells within the box surrounding each site, as implemented in the save_mapping_netcdf function.

That’s strange, because "site_nearest" and "site_box_all" use essentially the same abstract mapping method and data structure—namely, {col, row, S}. Both require no horizontal averaging (i.e., 1-to-1 extraction), and in both cases the weight matrix S contains only 1s. The only difference is the sample size along the dimension of S: it's 1063 in "site_box_all", versus only 32 in "site_nearest".

To test whether the sample size matters, I generated another mapping file, "site_nearest_1088", by manually repeating the sites 34 times—so that the sample size of the nearest mapping file approximates that of "site_box_all". The simulation and verification results have been uploaded to NERSC: /global/cfs/cdirs/e3sm/www/zhang73/forsite/WPRRMxx_cospNend_ctp_PR7350_massflux_p3.WP10ne32x32v1pg2_WP10ne32x32v1pg2.F2010-SCREAMv1.dane/tests/1120x1_nhoursx1_UVTQ3h-s20141001-finicold-cosp3rad3tendx2-O15all/run/

Surprisingly, "site_nearest_1088" works !!?

This suggests that the behavior of "site_nearest" depends on the sample size in the mapping file (yeah... I don't really get it :(

@mahf708 anyway, for practical purposes, "site_box_all" is sufficient and valid for use (and "site_nearest" is just a subset of it, we can handle this in postprocessing). Although the underlying issue is still unclear, do you think we should tag this as low priority?

jsbamboo avatar Jun 20 '25 14:06 jsbamboo

@jsbamboo Have you tried a sample sizes between 32 and 1063? I know it's a lot of work, so no worries if you want to punt, but it may be interesting to try a few other sizes, like 62, 128, 256. Can you remind me what was the grid size in this problem, and how many ranks you ran with? Perhaps as the size increases, we get to a point where all ranks have 1+ site and everything works? That would point to an issue with how we handle a map file when certain ranks own zero points in the tgt grid... Long shot though...

bartgol avatar Jul 01 '25 22:07 bartgol

@bartgol OMG, it's exactly around N = 32 😱 (how lucky am I…) everything works correctly when N >= 34, but weird things start to happen when N <= 32..

for N = 33 (only one site was repeated), one test passed completely when I repeated siteA, but in another test where I repeated siteB, the last site turned out to be incorrect (zero)... Then, when I repeated siteB twice to bring N = 34, everything worked correctly 😹 I really can't wrap my head around this weird behavior

the test result is put on nersc

/global/cfs/cdirs/e3sm/www/zhang73/forsite/WPRRMxx_cospNend_ctp_PR7350_massflux_p3.WP10ne32x32v1pg2_WP10ne32x32v1pg2.F2010-SCREAMv1.dane/tests/1120x1_nhoursx1_UVTQ3h-s20141001-finicold-cosp3rad3tendx2-O20all/run/O20*

e.g., vimdiff O20_33-v2 O20_33 ->to see the difference at the last site (repeat siteB vs. siteA) vimdiff O20_33-v2 O20_34-33v2x2 -> to compare against the N = 34 version with siteB repeated twice

jsbamboo avatar Jul 02 '25 14:07 jsbamboo

Looking at your env files, I see NTASKS=1120, which means plenty of ranks are without a col in the tgt grid, even for N>32. So my speculation that the issue may be related to not handling correctly the case of a (locally) empty tgt grid is incorrect.

I don't really understand where the issue may be coming from. We can keep the issue open, and if you get more insight/data, feel free to add a comment. Meanwhile, considering you said the "site_box_all" is working and good enough for your needs, I am afraid I have to move to other more pressing tasks.

bartgol avatar Jul 02 '25 17:07 bartgol

thank you for sharing your thoughts! yes, let's put this issue on hold — no problem at all!

jsbamboo avatar Jul 02 '25 19:07 jsbamboo