WPS icon indicating copy to clipboard operation
WPS copied to clipboard

Double Precision WRF Build Causes Metgrid Failure

Open kkeene44 opened this issue 7 years ago • 13 comments

When WRF is built with double-precision, it causes metgrid to fail with error: "Require grid spacing (dx) in meters to be positive! ERROR: MAP_INIT" This error is located in module_map_utils.F namelist.wps declares dx = 30000 print statements in module_map_utils.F confirms that dx is being read as 0.0000000E+00 map_proj = 'lambert'

kkeene44 avatar Mar 10 '17 21:03 kkeene44

Are the values of the DX and DY global attributes correct in the geogrid output files?

mgduda avatar Mar 10 '17 21:03 mgduda

Yes, they are.

kkeene44 avatar Mar 10 '17 21:03 kkeene44

I found that the placement of the read attribute calls for DX and DY in input_module.F had an impact on a problem that might be related to this one. Placing them at the bottom, below the read for sr_y, seems to fix the issue. I can't explain why, but it might be worth trying here.

Apologies if this has already been resolved.

pblossey avatar Sep 09 '18 23:09 pblossey

I found that the placement of the read attribute calls for DX and DY in input_module.F had an impact on a problem that might be related to this one. Placing them at the bottom, below the read for sr_y, seems to fix the issue. I can't explain why, but it might be worth trying here.

Apologies if this has already been resolved.

This works for me, thanks! Can someone explain the reason why?

zhihua-zheng avatar Feb 27 '19 16:02 zhihua-zheng

I have the same problem, however your suggested fix doesn't work for me. After a bit of debugging, I found out that dx seems to be reset to 0. during the call to ext_ncd_get_dom_ti_real in ext_get_dom_ti_real_scalar in input_module.F for dy.

This, to me, reeks of memory shenanigans, however my Fortran is not good enough to figure out where the issue lies. I tried to workaround it with a temporary variable storing the value for dx and reassigning it after the NETCDF read for dy, however it breaks with a segfault adding further to my memory-issue hunch:

[XXXX:1031216:0:1031216] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x17cf030)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x174f0) [0x7fc7555a74f0]
    1  /lib64/libucs.so.0(+0x176a2) [0x7fc7555a76a2]
    2  ./metgrid.exe() [0x43bd75]
    3  ./metgrid.exe() [0x410eb0]
    4  ./metgrid.exe() [0x45784e]
    5  ./metgrid.exe() [0x45c20d]
    6  ./metgrid.exe() [0x40ce12]
    7  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fc767e7c3d5]
    8  ./metgrid.exe() [0x404315]
===================

lpilz avatar Jun 04 '20 11:06 lpilz

After looking at the code, I saw that ext_ncd_get_dom_ti_real expects a real array as input, but ext_get_dom_ti_real_scalar passes it a scalar. See ext_ncd_get_dom_ti_real here (in a different WRF github repository): https://github.com/cloudruninc/wrf/blob/b80a9ddc39590e316d9f16cec05564e0428063bb/external/io_netcdf/wrf_io.F90#L1758 The WPS version of ext_get_dom_ti_real_scalar is here: https://github.com/wrf-model/WPS/blob/b98f858ec4e356b8eff9e2ddcc797890f16734ce/metgrid/src/input_module.F#L715

While I haven't tested this, I would suggest creating a real array of length one in ext_get_dom_ti_real_scalar and pass that into ext_get_dom_ti_real, and the use that value to set var_value after completion. Sometimes debuggers will complain if scalars are passed into a real array. I agree that it's a memory issue. Perhaps, this is causing it, but memory issues can be hard to chase down, so that I'm not sure.

pblossey avatar Jun 04 '20 17:06 pblossey

This seems to work. @pblossey thanks so much for the heads up. I just created a PR :)

lpilz avatar Jun 07 '20 11:06 lpilz

@kkeene44 I'm just getting around to investigating. WRF was compiled with double-precision reals, but was the WPS also compiled with default double-precision reals?

As @pblossey recommended, and as @lpilz implemented in PR #155 , I think the correct solution here is to create temporary arrays for scalar attributes in calls to, e.g., ext_ncd_get_dom_ti_real. I'm not sure why this would only affect reals, since the integer and logical versions of the attribute setting/getting routines expect arrays as well.

mgduda avatar Jul 24 '20 16:07 mgduda

@lpilz @mgduda Thanks so much for testing, implementing and following through with this fix.

I did have one small suggestion regarding the changes to input_module.F. Since var_value is intent(out), it may be undefined at the start of ext_get_dom_ti_real_scalar, so that it might be best not to initialize var_arr(1) to var_value at the top of the routine. With full debugging and checking turned on, this might cause a debugger to trip when var_arr(1) is set to an undefined variable. I haven't tested this, so that I'm not positive, but I wanted to mention it.

I have no idea why this is only an issue for reals and not logicals or integers.

pblossey avatar Jul 24 '20 17:07 pblossey

I spent a bit of time trying to better understand this issue, and in summary, the issue seems to be that when WRF is compiled with default double-precision reals, the ext_ncd_get_dom_ti_real routine expects the Data argument to refer to an array of 64-bit values, while the WPS (being compiled with default single-precision reals) is passing a reference to a 32-bit value. One specific consequence of this it that, when we attempt to read the DY attribute, memory holding the dx variable gets overwritten, leading to problems later in the execution of metgrid.

This overwriting of memory for other attributes can be avoided by allocating local memory in the metgrid program's ext_get_dom_ti_real_scalar routine that is presumed to be "far enough" away from the storage for other variables that the overwriting of four bytes of data doesn't present a problem; if this temporary variable is allocated at the end of a stack frame, overwriting four bytes beyond its storage seems to be harmless. This is backed up by the fact that, in my testing, I don't need to allocate a temporary array, only a scalar variable; i.e., rather than

      real, intent(out) :: var_value
...   
      real, dimension(1) :: var_arr
...
         call ext_ncd_get_dom_ti_real(handle, trim(var_name), &
                                         var_arr, &
                                         1, outcount, istatus)
...
      var_value = var_arr(1)

I'm able to get correct results with

      real, intent(out) :: var_value
...   
      real :: temp
...
         call ext_ncd_get_dom_ti_real(handle, trim(var_name), &
                                         temp, &
                                         1, outcount, istatus)
...
      var_value = temp

I'm not disagreeing that we should pass an array actual argument to an array dummy argument; the purpose of the above test was simply to demonstrate that the underlying issue seems to be one of WRF's code writing 64-bits of data to a location that is associated with a 32-bit variable.

mgduda avatar Jul 24 '20 22:07 mgduda

The previous comment highlights what I see as a fundamental problem with the WPS's use of the WRF I/O API library: because the compiler has no explicit interfaces available to it when compiling the WPS, there is no way to check that the types of actual arguments in the WPS code match the types of the dummy arguments in the WRF I/O API implementation. We could declare interfaces in the WPS code, but those interfaces would have no way of knowing whether the default real type in WRF was promoted to 64-bits (e.g., with compiler flags like -r8 or -fdefault-real-8).

@davegill Do you have any opinions on how we might address this? Do you recall whether it was an explicit design choice to not place the I/O API implementations in Fortran modules in WRF?

mgduda avatar Jul 24 '20 22:07 mgduda

As to whether the use of local variables as proposed in PR #155 to hide the effect of memory overwriting is sufficient for now, my opinion is that if the issue is not urgent, it would be nice to develop a more thorough approach. For example, it's not clear that the proposed fix would work on big-endian architectures (the concern being that the four bytes of memory that were clobbered would actually contain the 32-bits of the attribute value, while the 32-bits of memory that store the single-precision attribute in the WPS would receive four bytes of garbage). Also, if someone were to modify the metgrid program's ext_get_dom_ti_real_scalar routine in future with more local variables, it may be those local variables that get overwritten, rather than unused stack memory, depending on the order of variable declarations.

A simple extension of PR #155 might be to declare the temporary array with an extra element for padding, but that seems a bit hacky.

mgduda avatar Jul 24 '20 22:07 mgduda

Hi - I just wanted to mention an issue I recently had regarding a double-precision gfortran WRF 4.2 build and WPS-4.2. After incorporating the above fix, metgrid appeared to work. However, cen_lat and cen_lon were constantly being reset to zero, and the corner_lats were also zeros in the met_em files, which led to wrong projections/maps in the wrfout files. And I could only run on up to 8 processors. No matter how many CPUS were available on the node. I finally had to recompile WRF to a single precision build. I just wanted to 1) see if anyone else had this particular issue with the gfortran double-precision build, and 2) to make folks aware that while metgrid will run with a center lat and lon set to 0, it might not actually be working properly.

slekaw avatar Sep 23 '20 22:09 slekaw