Potential bug in image driver handling of parameter file, and/or tonic
- Version: VIC.5.0.1 develop branch
- Compiler: gcc 5.4.0
- Operating system: ubuntu 16.04.9
- Model settings: image driver
- Problem summary:
When using parameter files that I have generated from scratch, various states and fluxes get weird values that fairly quickly lead to
nans. This doesn't happen with parameter files that I've generated withtonicfrom the Livneh et al (2015) ascii soil/veg/snowband parameter files.
I believe that the problem is caused by a combination of how I created my parameter files and some potentially missing/bad logic in VIC (and tonic). But when I try to fix my parameter files to make them more amenable to VIC, VIC still has trouble. I am not sure whether it's because I haven't succeeding in fixing my parameter files, or whether there's some other problem occurring. I could use some advice and/or help brainstorming for solutions here...
More detail:
My new parameter files generated from scratch differ from the tonic-generated files in a few ways:
- for "vegetation library" parameters and root zone parameters,
tonicsets them to "valid" values (the same constant value) in all classes in all grid cells, regardless of whether those classes are actually present in the grid cell (i,e,Cv[v] >= 0). In my parameters, I only defined them whereCv[v] > 0, and set them tomissingwhere the vegetation class is not present (Cv[v] == 0). toniccreates an extra bare soil tile in cells where the vegetated tile areas sum up to less than 1 (Cv_sum < 1); but does NOT incrementNvegto reflect the extra tile (i.e.,Nveg < Nnonzero, whereNnonzerois the number of tiles for whichCv[v] > 0). In my parameters, the tile areas always reflect exactly what is present in the land cover map;Cv_sumalways== 1(except for numerical error); andNveg == Nnonzero.tonicparameter files define all floating point variables as typedouble. I defined mine asfloatto save space, since I didn't need the precision.
In vic_run(), I saw that bad values (they looked a lot like the missing value I'd given them in my parameter file) of veg parameters such as trunk_ratio were being passed to CalcAerodynamic(). This could certainly explain why the tonic parameter files didn't cause problems (since even invalid tiles have valid parameter values in tonic-generated parameter files). To verify that missing values from invalid tiles were getting passed, I replaced those values in the parameter file with an arbitrary value (again, only where Cv[v] == 0), and these values showed up being passed to CalcAerodynamic().
As for why these garbage values were being used, I hypothesized that VIC was somehow mistaking zero-area tiles for valid tiles. Looking at the code, in get_global_domain(), VIC assigns nv_active to Nveg+1, where Nveg is the value from the parameter file; then in vic_alloc() it allocates nv_active tiles to veg_con and calls make_all_vars() with nv_active-1, which allocates nv_active-1+1 tiles to veg_var, cell, etc. But then, vic_init() does its mapping between the all-classes-stored structure veg_con_map and the only-nonzero-classes structure veg_con by checking whether Cv > 0. Nveg is NOT involved in this mapping. My theory at the moment is that my single-precision Cv values might not look like 0 to VIC when VIC is interpreting them as doubles, thus causing the mapping to map invalid tiles to the veg_con structure.
However, I've tried changing Cv in my parameter file to double, and tried rounding all values to 6 digits of precision to ensure that 0s look like 0s. But it doesn't fix the problem. So, if I've not properly zeroed the 0s this way, I could definitely use help getting that to work correctly (my python skills are still not at the level of the UW lab).
But I don't know for sure if that's what happened - maybe I did succeed in zeroing out the 0s (it looked that way in a python session), and maybe something else is causing the problem in VIC.
Regardless, I can also see that VIC and tonic might benefit from improved code to counter these problems. Tonic's inconsistent incrementing of Nveg seems dangerous. In VIC, we could easily implement a check on whether the total number of tiles with Cv[v] > 0 actually equals Nveg - if not, it could give the user a helpful error message so that the user knows exactly what the problem is.
Also, I think tonic's assignment of valid parameter values to cell/class combinations where the classes are not present is a bad policy, since if VIC for some other reason were to accidentally assign invalid tiles, the error would not be caught (the parameters would be physically reasonable, just wrong).
Any thoughts/advice on this would be greatly appreciated. I need to get my simulations going ASAP!
Another potential culprit is appearing: vic_run_veg_lib seems to be wrong.
Another potential culprit: parallelization. I've noticed that, if I insert print statements into anything within the loop over cells in vic_image_run() or any function below that, like vic_run(), the order of grid cells gets all jumbled up. I've attempted to run with a single processor via mpirun -n 1 but maybe I'm doing it wrong? Regardless, running with mpirun -n 1 or omitting it and just running vic_image.exe -g .... by itself both result in jumbled time order of the execution of various print statements, to the point that it becomes very difficult to interpret.
I'm not saying that it's wrong to have parallelization per se; I'm saying that if parallelization is occurring even when I tell it to use a single processor, perhaps that could be causing a mis-match between the i of veg_lib[i] (which is carried indirectly through vic_run_veg_lib) and the i of veg_con[i], such that the veg_class called for by veg_con has veg_lib properties of invalid tiles from other cells. I.e., perhaps we're asking for properties of a veg_class in the current cell, but we're presented with the veg_lib of a different cell for which that class is not present; thus we get the garbage values. This would be consistent with things working OK for the tonic-generated param files, since again, they define the complete veg_lib for all classes in all cells. But it would be getting the right answer for the wrong reason... Perhaps there's an issue with my mpich library on my machine...
Now, the primary suspect is OpenMPI (at least the version on my machine, which might not be the right version). I deleted the line in vic_image_run() immediately before the loop over grid cells, which contains a pragma instructing OpenMPI to run the for loop using multiple threads. Suddenly my simulations ran fine.
The questions now are:
- Is there something wrong with my OpenMPI? (and if so, how to fix to either handle the pragma correctly or to tell it to ignore the pragma)
- How easily could this happen to other users? Is it happening at UW but not being detected due to all grid cells containing redundant copies of the veg_lib?
- Regardless, is it a good idea for vic param files to define veg_lib parameters for zero-area veg tiles? (in terms of space usage, possibility for corrupt values where validation doesn't take place, etc.)
@tbohn -thanks for the report.
- For clarity, the
pragmastatement is for OpenMP (not OpenMPI) . - If you put the
pragmastatement back and set theOMP_NUM_THREADSenvironment variable to 1, do you still see the buggy behavior?export OMP_NUM_THREADS=1 - If you compile without OpenMP but with the
pragmastatement in place, do you see the buggy behavior? For that, you'd need to remove the following line from the makefile. https://github.com/UW-Hydro/VIC/blob/9fae303d62d3b76104fe99689c2b1f8c6e71cba3/vic/drivers/image/Makefile#L91 - Finally, are you still building against anaconda libraries? I'm suspect of that configuration so it would be good to know if this is repeatable on a different set of libraries.
To answer your questions:
- yes,
export OMP_NUM_THREADS=1fixes the problem. - yes, commenting out the
-fopenmpline fixes the problem. - yes, I'm still building against the anaconda libraries.
@tbohn - how are you executing VIC? Do you get the odd behavior if you don't use MPI?
Can you run a test with:
export OMP_NUM_THREADS=8 # or some number of threads > 1
./vic_image -g your_param.txt # omit mpirun
The above test reproduces the bug.
That test is identical to one of the several ways that I ran vic and encountered the bug (i.e., I ran vic both with and without mpirun), except I wasn't explicitly setting OMP_NUM_THREADS to anything. So this implies that my environment somehow uses multiple threads by default.
I think in order to close this issue we should add a test that checks for bit for bit equivalence between the single core image driver and the openmp enabled image driver. That should probably fall on me but if someone else is interested and gets to it first, go for it.
@jhamman or @tbohn is there any update on this issue? I am running into a similar issue with NA values in my VIC output when running with OpenMPI, VIC5.1.0, and the calibrated VIC parameter file and domain from Currier et al. 2023. I get many NAs throughout the CRB domain.
I have tried a variety of the suggestions in this and other threads and in the VIC image driver documentation page (MX_RCACHE = 2, removing the -fopenmp flag, etc.), as well as trying different forcings, etc. We did not build against any conda libraries.
I am able to run all of the VIC sample datasets without this issue.
The issue presents as either large chunks of the domain as NA immediately:
Or NAs becoming more prevalent over time:
Dear dwoodson, I left academia 4 years ago and have not worked with VIC since. I can't help you with this issue. Regards, Ted Bohn
On Mon, Dec 25, 2023 at 5:11 PM dwoodson-usbr @.***> wrote:
@jhamman https://github.com/jhamman or @tbohn https://github.com/tbohn is there any update on this issue? I am running into a similar issue with NA values in my VIC output when running with OpenMPI, VIC5.1.0, and the calibrated VIC parameter file and domain from Currier et al. 2023. I get many NAs throughout the CRB domain.
I have tried a variety of the suggestions in this and other threads and in the VIC image driver documentation page (MX_RCACHE = 2, removing the -fopenmp flag, etc.), as well as trying different forcings, etc. We did not build against any conda libraries.
I am able to run all of the VIC sample datasets without this issue.
The issue presents as either large chunks of the domain as NA immediately: image.png (view on web) https://github.com/UW-Hydro/VIC/assets/106703387/6a89df65-931a-4d05-943a-85cff140da9b
Or NAs becoming more prevalent over time: image.png (view on web) https://github.com/UW-Hydro/VIC/assets/106703387/7a1b4b2a-6dd6-4b19-b600-30de405b4513
— Reply to this email directly, view it on GitHub https://github.com/UW-Hydro/VIC/issues/787#issuecomment-1869177390, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2VS6WTG4DEAHHI74ESKLTYLIP37AVCNFSM4EUDLR4KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBWHEYTONZTHEYA . You are receiving this because you were mentioned.Message ID: @.***>
@tbohn thank you for your response!
-David
The NA values in my VIC output were caused by two issues:
- Not setting any values for OMP_NUM_THREADS before running the image driver with OpenMPI, this caused a phenomenon were only a few cells were NA at the simulation start but the number of NA cells grew over time and seems to be a conflict between OpenMP and OpenMPI. export OMP_NUM_THREADS=1 solved this issue.
- When generating my VIC forcings, I used MetSim. However, I made my MetSim domain file using the R terra package and doing so imparted something into the VIC forcings which caused large zones of NA cells in the VIC output, even though the VIC forcings looked fine upon inspection and analysis. Creating the MetSim domain with Python code did not cause this issue, even though the MetSim domain from R and Python were essentially identical in terms of mask and elevation. So perhaps something with the drivers used in the R terra package.