ocgis icon indicating copy to clipboard operation
ocgis copied to clipboard

Add more metadata to the output file...

Open ekluzek opened this issue 4 years ago • 34 comments

The mapping files from OCGIS are pretty bare bones and need more meta-data added to them.

I'd like to see the same sort of metadata that are on the ESMF RegridWeights mapping files. Such as...

// global attributes: :title = "ESMF Offline Regridding Weight Generator" ; :normalization = "destarea" ; :map_method = "Conservative remapping" ; :ESMF_regrid_method = "First-order Conservative" ; :conventions = "NCAR-CSM" ; :domain_a = "/glade/p/cesm/cseg/inputdata/lnd/clm2/mappingdata/grids/SCRIPgrid_0.25x0.25_MODIS_c170321.nc" ; :domain_b = "/glade/p/cesm/cseg/inputdata/lnd/clm2/mappingdata/grids/0.9x1.25_c110307.nc" ; :grid_file_src = "/glade/p/cesm/cseg/inputdata/lnd/clm2/mappingdata/grids/SCRIPgrid_0.25x0.25_MODIS_c170321.nc" ; :grid_file_dst = "/glade/p/cesm/cseg/inputdata/lnd/clm2/mappingdata/grids/0.9x1.25_c110307.nc" ; :CVS_revision = "6.3.0r" ;

We also add the hostname run on, the user-name of the user doing it, and the "history", so the date done and the exact command that was launched.

ekluzek avatar Sep 30 '19 16:09 ekluzek

Some of this functionality maybe needs to go into ESMF's concurrent weight file write routine. Regardless, this should be a straightforward improvement.

bekozi avatar Sep 30 '19 19:09 bekozi

@rokuingh Added filemode option to ESMPy (https://github.com/esmf-org/esmf/tree/ESMPy-filemode). I've started integrating this into the chunked regridding.

Ping @slevisconsulting

bekozi avatar Mar 23 '20 13:03 bekozi

Feature branch: https://github.com/NCPP/ocgis/tree/i506-esmpy-filemode

bekozi avatar Mar 23 '20 15:03 bekozi

This is implemented but will require a beta snapshot of ESMF to work. The weight file output is equivalent to the standard ESMF weight file with auxiliary variables and attributes.

@ekluzek I wanted to follow-up on:

We also add the hostname run on, the user-name of the user doing it, and the "history", so the date done and the exact command that was launched.

We could add arbitrary attributes to the output weight file using a JSON string as an argument to ocli. Is this something that sounds appealing?

bekozi avatar Mar 24 '20 15:03 bekozi

@bekozi hmmm. I'm not sure there's much of a reason to add an arbitrary string as a global attribute to the file, as I can use NCO to add it easily afterwards. But, what about adding those specific things: hostname, user-name, and date? The CF convention has "history" as a standard global attribute, it typically is the date/time that the command line for the creation program/script that was run. username and hostname could be additional things tacked on as well. I think all of these are pretty standard things that are useful to see and document the file and how it was created. I've found this kind of documentation to be extremely helpful when you go back later and try to figure out how a file was created. This is the kind of thing that I continually have to do over and over again, some documentation in the global attributes make it easy -- but otherwise it can be difficult to impossible to do.

ekluzek avatar Mar 25 '20 20:03 ekluzek

@ekluzek Got it. Let me cook something up and get back to you with an example.

bekozi avatar Mar 25 '20 21:03 bekozi

@ekluzek I added the three attributes. They look like:

created_by_user     :: 'benkoziol'
created_on_hostname :: 'system76-laptop'
created_at_datetime :: '2020-03-30 09:42:24.216163'

The names/values can be adjusted fairly easily. I think the user and hostname retrieval are pretty portable, but it may take some fine tuning on some platforms.

bekozi avatar Mar 30 '20 14:03 bekozi

@bekozi that's great, that gives me the kind of metadata that I've found to be really useful. One other thing I've found useful is the version of the program or script that created the file. For something checked out under git, I store the output of "git describe".

And just to point you to the CF conventions for attributes. I don't know if you are trying to follow any specific conventions -- but that's a good one to follow. The history attribute on it is useful as it both adds the creation date, as well as the program that produced it. And then if someone manipulates it again that manipulation will be added to the history. So history is a good attribute to follow the convention for.

Here's the CF conventions...

http://cfconventions.org/cf-conventions/cf-conventions.html#attribute-appendix

ekluzek avatar Mar 30 '20 17:03 ekluzek

@ekluzek In general, these weight files do not follow a convention (I guess it's a SCRIP weight file but no real convention around that). I can add the CF history attribute to the output weight files no problem. Is this where you'd prefer to have the creation information as well? I guess I'm asking if you'd prefer to have the "created" attributes in addition to the "history" attribute.

bekozi avatar Mar 31 '20 13:03 bekozi

The creation date is best off in the history attribute, because you can then figure out any follow on history. If you have creation_date as a separate attribute, it's not clear to what operation it applies to when there is a string of manipulations on the file. But, the user and hostname don't necessarily lend themselves to easily go into "history". So I've put them as separate attributes and then just need to know that it goes with the original operation on the file, rather than any subsequent ones.

ekluzek avatar Mar 31 '20 16:03 ekluzek

Makes sense to me. I'll take this opportunity to format the ocli command line arguments into the history string. Will be back with an example for review.

bekozi avatar Apr 01 '20 13:04 bekozi

@ekluzek How does this look?

// global attributes:
		:created_by_user = "benkoziol" ;
		:created_on_hostname = "system76-laptop" ;
		:history = "2020-04-01 10:02:49.028146: Created by ocgis (v2.1.1) and ESMF (v8.1.0 beta snapshot) with CLI command: ocli chunked-rwg --weightfilemode BASIC --loglvl INFO --no_verbose False --spatial_subset_path /tmp/ocgis_test_p5i8p9n3/spatial_subset.nc --no_ignore_degenerate False --wd /tmp/ocgis_test_p5i8p9n3/chunks --esmf_regrid_method BILINEAR --esmf_dst_type GRIDSPEC --esmf_src_type GRIDSPEC --weight /tmp/ocgis_test_p5i8p9n3/weights.nc --destination /tmp/ocgis_test_p5i8p9n3/destination.nc --source /tmp/ocgis_test_p5i8p9n3/source.nc" 

bekozi avatar Apr 01 '20 15:04 bekozi

Perfect. Works for me.

ekluzek avatar Apr 01 '20 23:04 ekluzek

Great! I'll work on getting this and the esmf branch merged.

bekozi avatar Apr 03 '20 12:04 bekozi

For reference, the associated esmpy PR is: https://github.com/esmf-org/esmf/pull/4

bekozi avatar Apr 08 '20 15:04 bekozi

@slevisconsulting - I'm reopening this to address the issue related to writing auxiliary coordinate variables for high resolution grids. I'm planning to enable the appropriate flags in an ESMF branch to confirm this will fix the problem. I'll then add the appropriate parameters to ESMPy and ocgis.

bekozi avatar Aug 17 '20 17:08 bekozi

Thank you @bekozi

For my benefit, I'm linking this issue to my PR here.

slevis-lmwg avatar Aug 17 '20 18:08 slevis-lmwg

@rokuingh is adding the 64-bit offset flag to ESMPy. He also identified an issue where the file types were not passed to ESMF routines correctly. I'll bring the offset flag into ocli once it's ready in ESMPy. I tested statically setting the flags for the higher resolution UGRID->SCRIP case using a reproducer from @slevisconsulting, and the operation works with auxiliary coordinates.

bekozi avatar Aug 20 '20 19:08 bekozi

New concern relating to auxiliary data in the context of CTSM's surface data generation (with a piece of very good news):

Running ./mksurfdata_map to generate a surface dataset appears to work now! However, the corresponding log file shows zeros for all variable areas at both the input (raw data) resolutions as well as the output (surface data) resolution. This is because auxiliary variables areaa and areab contain all zeros. This makes CTSM's error-checking unusable.

slevis-lmwg avatar Oct 12 '20 20:10 slevis-lmwg

@rokuingh ESMPy's auxiliary variable support will need to be modified to include areas when writing weight files. Is this possible within the current implementation of WITHAUX?

bekozi avatar Oct 12 '20 20:10 bekozi

I am no expert on ESMF IO, but it looks like the routine that is responsible for writing the weight files does indeed handle the areas (and fractions). The routine consists of a couple thousand lines of Fortran. A quick pass through the code seems to imply that areas are only written when using the conservative method.

rokuingh avatar Oct 13 '20 23:10 rokuingh

it looks like the routine that is responsible for writing the weight files does indeed handle the areas (and fractions). The routine consists of a couple thousand lines of Fortran. A quick pass through the code seems to imply that areas are only written when using the conservative method.

Thank you, @rokuingh @bekozi if by "conservative method" we mean this option --esmf_regrid_method CONSERVE, then this is what we're doing. So the problem remains that the area variables areaa and areab are all zeros in all the weight files that I've looked at.

slevis-lmwg avatar Oct 14 '20 18:10 slevis-lmwg

I will debug this further later this week. Could one of you please send me the aforementioned reproducer?

rokuingh avatar Oct 14 '20 19:10 rokuingh

I think the trouble is that the areas are difficult to connect to ESMF_OutputScripWeightFile the way esmpy is calling it. Another solution here is to put a Python wrapper on ESMF_RegridWeightGenFile. It does not necessarily look difficult to wrap, but it does look time consuming. Another option would be to call the CLI RWG to create the weights for each chunk combination and merge them afterwards. What do you think @rokuingh?

bekozi avatar Oct 14 '20 20:10 bekozi

I will debug this further later this week. Could one of you please send me the aforementioned reproducer?

qsub /glade/work/slevis/ocgis_work/no_subset_20200825_reproducer.sh

slevis-lmwg avatar Oct 14 '20 21:10 slevis-lmwg

@rokuingh cc: @bekozi is there an update regarding the aforementioned debugging? This issue blocks the use of ocgis in CTSM's mkmapdata tool.

slevis-lmwg avatar Jan 22 '21 00:01 slevis-lmwg

@slevisconsulting Sorry for the long wait, but I do have a good idea of how to proceed with this. I am working on the upcoming ESMF 8.1.0 release right now, but I have just been approved to work on this next. I will plan to have a snapshot for you before the end of the month.

rokuingh avatar Feb 04 '21 19:02 rokuingh

@rokuingh thank you for prioritizing this issue, I appreciate your help.

slevis-lmwg avatar Feb 04 '21 20:02 slevis-lmwg

@slevisconsulting I have been experimenting with this reproducer on Cheyenne, but I have not yet had a successful run even with a walltime of 1 hour. would you mind running this again on your end to make sure nothing has changed with the machine or environment that could explain the issues I am having? In the meanwhile I will move forward with adding the area variables to the weight files.

rokuingh avatar Apr 22 '21 01:04 rokuingh

@rokuingh I have not run this script in a while (likely since Oct 2020). Thank you for the heads-up about it failing. I will look into it soon.

Meanwhile, thank you for moving fwd with adding the area variables to the weight files.

slevis-lmwg avatar Apr 23 '21 19:04 slevis-lmwg