netcdf-c icon indicating copy to clipboard operation
netcdf-c copied to clipboard

Current issues renaming dimensions and coordinates in netCDF4 files

Open czender opened this issue 8 years ago • 86 comments

Old issues with renaming dimensions and coordinates in netCDF4 files are beginning to affect CMIP6 processing so here's a reminder. All recent library versions are affected. See here for a chronology. Create a netCDF4 file with a coordinate:

netcdf bug_rnm {
       dimensions:
            lon = 4 ;
       variables:
           float lon(lon) ;
       data:
          lon = 0, 90, 180, 270 ; 
}

Remember to re-create the file after each of the three tests below. Renaming both dimension and coordinate together works. Yay! This surprises me, given the history recounted above:

	 ncrename -d lon,longitude -v lon,longitude ~/bug_rnm.nc # works

Renaming dimension then coordinate separately fails:

       ncrename -d lon,longitude ~/bug_rnm.nc
       ncrename -v lon,longitude ~/bug_rnm.nc # borken "HDF error"

Renaming coordinate then dimension separately fails:

       ncrename -v lon,longitude ~/bug_rnm.nc
       ncrename -d lon,longitude ~/bug_rnm.nc # borken "nc4_reform_coord_var: Assertion `dim_datasetid > 0' failed."

czender avatar Nov 07 '17 03:11 czender

Thanks @czender we were flooded with other PR's and I'm trying to manage those combined with the other open issues, we want to get the next bugfix release out shortly, so trying to address the issues you have open right now (in addition to working on some of the PR's Dennis has for compression, etc).

On Mon, Nov 6, 2017 at 8:52 PM, Charlie Zender [email protected] wrote:

Old issues with renaming dimensions and coordinates in netCDF4 files are beginning to affect CMIP6 processing so here's a reminder. All recent library versions are affected. See here http://nco.sf.net/nco.html#bug_nc4_rename for a chronology. Create a netCDF4 file with a coordinate:

netcdf bug_rnm { dimensions: lon = 4 ; variables: float lon(lon) ; data: lon = 0, 90, 180, 270 ; }

Remember to re-create the file after each of the three tests below. Renaming both dimension and coordinate together works. Yay! This surprises me, given the history recounted above:

ncrename -d lon,longitude -v lon,longitude ~/bug_rnm.nc # works

Renaming dimension then coordinate separately fails:

   ncrename -d lon,longitude ~/bug_rnm.nc
   ncrename -v lon,longitude ~/bug_rnm.nc # borken "HDF error"

Renaming dimension then coordinate separately fails:

   ncrename -v lon,longitude ~/bug_rnm.nc
   ncrename -d lon,longitude ~/bug_rnm.nc # borken "nc4_reform_coord_var: Assertion `dim_datasetid > 0' failed."

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/597, or mute the thread https://github.com/notifications/unsubscribe-auth/AEH-UmNZUGFrWoYswzAw4jyP-JQVCt1Cks5sz9PzgaJpZM4QUOl- .

WardF avatar Nov 07 '17 18:11 WardF

Howdy @czender !

I found a test written by Quincey that deals with renaming, nc_test4/tst_rename.c.

I have add to the existing tests, and reproduced your error in C code. Here's the test:

fprintf(stderr,"*** Test Charlie's test for renaming...");
 {
#define CHARLIE_TEST_FILE "nc_rename_coord_dim.nc"
#define DIM1_NAME "lon"
#define VAR1_NAME "lon"
#define DIM2_NAME "longitude"
#define VAR2_NAME "longitude"
#define DIM1_LEN 4
#define NDIM1 1
    int ncid, dimid, varid;
    float data[DIM1_LEN] = {0, 90.0, 180.0, 270.0};
    
    if (nc_create(CHARLIE_TEST_FILE, NC_NETCDF4, &ncid)) ERR;
    if (nc_def_dim(ncid, DIM1_NAME, DIM1_LEN, &dimid)) ERR;
    if (nc_def_var(ncid, VAR1_NAME, NC_FLOAT, NDIM1, &dimid, &varid)) ERR;
    if (nc_enddef(ncid)) ERR;
    if (nc_put_var_float(ncid, varid, data)) ERR;
    if (nc_close(ncid)) ERR;

    /* Reopen the file to check. */
    if (check_charlies_file(CHARLIE_TEST_FILE, DIM1_NAME, VAR1_NAME)) ERR;

    /* Open the file and rename the dimension. */
    if (nc_open(CHARLIE_TEST_FILE, NC_WRITE, &ncid)) ERR;
    if (nc_redef(ncid)) ERR;
    if (nc_rename_dim(ncid, 0, DIM2_NAME)) ERR;
    if (nc_enddef(ncid)) ERR;
    if (nc_close(ncid)) ERR;

    /* Reopen the file to check. */
    if (check_charlies_file(CHARLIE_TEST_FILE, DIM2_NAME, VAR1_NAME)) ERR;

    /* Open the file and rename the variable. */
    if (nc_open(CHARLIE_TEST_FILE, NC_WRITE, &ncid)) ERR;
    if (nc_redef(ncid)) ERR;
    if (nc_rename_var(ncid, 0, VAR2_NAME)) ERR;
    if (nc_enddef(ncid)) ERR;
    if (nc_close(ncid)) ERR;

    /* Reopen the file to check. */
    if (check_charlies_file(CHARLIE_TEST_FILE, VAR2_NAME, VAR1_NAME)) ERR;

This fails like this:

*** Test Charlie's test for renaming...HDF5-DIAG: Error detected in HDF5 (1.10.1) thread 0:
  #000: H5Gdeprec.c line 459 in H5Gmove(): couldn't move link
    major: Links
    minor: Unable to initialize object
  #001: H5Gdeprec.c line 541 in H5G_move(): unable to move link
    major: Links
    minor: Can't move object
  #002: H5L.c line 2763 in H5L_move(): unable to find link
    major: Symbol table
    minor: Object not found
  #003: H5Gtraverse.c line 867 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #004: H5Gtraverse.c line 639 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #005: H5L.c line 2623 in H5L_move_cb(): unable to follow symbolic link
    major: Symbol table
    minor: Object not found
  #006: H5Gtraverse.c line 867 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #007: H5Gtraverse.c line 639 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #008: H5L.c line 2484 in H5L_move_dest_cb(): an object with that name already exists
    major: Symbol table
    minor: Object not found
Sorry! Unexpected result, tst_rename.c, line: 301

I will take a look and see what is going on here...

edhartnett avatar Nov 07 '17 21:11 edhartnett

OK, the problem can be seen from the h5dump:

HDF5 "nc_rename_coord_dim.nc" {
GROUP "/" {
   ATTRIBUTE "_NCProperties" {
      DATATYPE  H5T_STRING {
         STRSIZE 67;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "version=1|netcdflibversion=4.5.1-development|hdf5libversion=1.10.1"
      }
   }
   DATASET "lon" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): 0, 90, 180, 270
      }
      ATTRIBUTE "DIMENSION_LIST" {
         DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): (DATASET 659 /longitude )
         }
      }
      ATTRIBUTE "_Netcdf4Dimid" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SCALAR
         DATA {
         (0): 0
         }
      }
   }
   DATASET "longitude" {
      DATATYPE  H5T_IEEE_F32BE
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): 0, 0, 0, 0
      }
      ATTRIBUTE "CLASS" {
         DATATYPE  H5T_STRING {
            STRSIZE 16;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "DIMENSION_SCALE"
         }
      }
      ATTRIBUTE "NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 64;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "This is a netCDF dimension but not a netCDF variable.         4"
         }
      }
      ATTRIBUTE "REFERENCE_LIST" {
         DATATYPE  H5T_COMPOUND {
            H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
            H5T_STD_I32LE "dimension";
         }
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): {
               DATASET 331 /lon ,
               0
            }
         }
      }
   }
}
}

When the code hits this line in NC4_rename_var() it fails:

  /* Change the HDF5 file, if this var has already been created
      there. */
   if (var->created)
   {
      if (H5Gmove(grp->hdf_grpid, var->name, name) < 0)
         BAIL(NC_EHDFERR);
   }

So the code is trying to rename the dataset with H5Gmove, but can't because there is already a dataset with the new name. It was created to support the dimension name change - its one of those dreaded secret datasets to support HDF5 dimension scales.

I guess the way to proceed is to check before the H5Gmove to see if there is another var of the desired name which secretly exists, and then delete it, and then make the newly renamed dataset into a dimension scale.

I will try to take a stab at that...

edhartnett avatar Nov 07 '17 22:11 edhartnett

Thanks, @edhartnett. It sounds like an intricate fix, so I hope you #persist.

czender avatar Nov 08 '17 16:11 czender

Thanks, All!

cameronsmith1 avatar Nov 10 '17 18:11 cameronsmith1

This is happening on this branch: https://github.com/NetCDF-World-Domination-Council/netcdf-c/tree/ejh_zender_coordinate_var_rename

So far I have just added the new test that demonstrates the problem.

edhartnett avatar Nov 12 '17 20:11 edhartnett

@czender I am working on this now as part of my current branch. I have been sneaking up on this by writing tests for lots of simpler code in nc4var.c. Now I have done the easy stuff and am moving on to the rename bugs.

To help us understand the urgency of this fix, if any, can you elaborate on how it is being used in CMIP6? And what CMIP6 is?

Are we talking world-shatteringly important climate research here? Or what?

And why the heck are you renaming vars anyway? Can't you scientists just make up your minds? What is the use-case that is driving this? Can you help me understand the importance of this feature?

Thanks!

PS I am going to fix it anyway, obviously. You would have to physically restrain me to stop me from fixing it. I am just curious why you need it.

edhartnett avatar Dec 23 '17 02:12 edhartnett

CMIP6 is the large international climate model intercomparison for the next IPCC report. The CMIP6 organizers provided data in netcdf files for all of the different modeling groups to use. However, each climate model typically looks for a particular name for each input variable it needs, which varies from model to model (eg, sea_surface_temperature, SST, surfTemp), and there are hundreds of these, especially when we have to reprocess our output to match the CMIP6 required variable names. Hence the need to rename the variables.

The good news is that I have already worked around this bug (it was too important to wait), but it will be good to get it fixed for the future.

cameronsmith1 avatar Dec 23 '17 02:12 cameronsmith1

I actually use the NCO utilities, so this is a question for @czender .

cameronsmith1 avatar Dec 23 '17 03:12 cameronsmith1

Ed,In retrospect, I sure wih you guys had not been seduced by dimension scales :-)

DennisHeimbigner avatar Dec 23 '17 03:12 DennisHeimbigner

Mea culpa, mea culpa, mea maxima culpa.

edhartnett avatar Dec 23 '17 11:12 edhartnett

@cameronsmith1 thanks for the explanation.

What workaround are you using?

edhartnett avatar Dec 23 '17 12:12 edhartnett

I ended up using the nco utilities of @czender in a different way, so I am not sure how to express it in native HDF commands.

cameronsmith1 avatar Dec 24 '17 09:12 cameronsmith1

I receive a few bug reports a year from people trying to rename coordinates in netCDF4 files with ncrename. CMIP does this on an industrial scale. Most people end up using the multistep workaround(s) documented in the NCO manual here.

czender avatar Dec 24 '17 19:12 czender

OK I have a fix for this problem, as well as several other rename problems. I am still working on another one I found.

There is a bunch of state code in this which is not helpful. For example there are these values in NC_VAR_INFO_T:

   nc_bool_t is_new_var;        /* True if variable is newly created */
   nc_bool_t was_coord_var;     /* True if variable was a coordinate var, but either the dim or var has been renamed */
   nc_bool_t became_coord_var;  /* True if variable _became_ a coordinate var, because either the dim or var has been renamed */

These are set at various times in nc_rename_var() and nc_rename_dim().

Then in the sync that is called during nc_enddef() these values are consulted and various actions taken.

Problem is, it's not hard to come up with combinations of rename calls which confuse the crap out of the code. There are so many possible ways that users can name and rename vars and dims, with enddefs/redefs thrown in or not, It gets very, very confusing.

What would work better would be to assume a statelessness. Whenever a rename command is complete, the file on disk and in metadata must be completely adjusted and in a complete state. There must be nothing left over of the rename process for enddef to do. All work must be done in nc_rename_var() and nc_rename_dim().

If that were true, then the user could call them in any wild order, and it would not matter to us.

I am not attempting to take these out right now. I am patching the existing code so that it works for the test cases that we've identified, including Charlie's. But I suspect there are many more bugs out there with the existing code that will only be eliminated by removing these state variables and dealing with everything in the rename.

edhartnett avatar Dec 31 '17 20:12 edhartnett

OK, these fixes are up as PR #755. It's branch ejh_fill_values on my netcdf clone: [email protected]:NetCDF-World-Domination-Council/netcdf-c.git

@czender if you could grab the branch and give it a try, that would be great. I don't think I've fixed every possible rename problem, but see if your common cases now work OK.

Any problems would be best expressed by adding a test to nc_test4/tst_rename.c. Simply copy an existing test case, then change it to make it fail, and send me the code or post it here.

edhartnett avatar Jan 01 '18 12:01 edhartnett

Thanks for working on this. At the web-page that @czender mentioned, there is a long list of problems over the years with renaming variables, which supports your conclusion that the current implementation is fragile against the many possible renaming situations. Interestingly, @czender notes that a robust solution is to convert from netcdf4 to netcdf3, then rename and convert back to netcdf4. Does the implementation of rename for netcdf3 offer a useful template?

cameronsmith1 avatar Jan 01 '18 20:01 cameronsmith1

Thanks, @edhartnett. I will try to test this soon and report the results here.

czender avatar Jan 03 '18 22:01 czender

@czender you should actually use branch currently up as PR #763.

This PR contains just the rename fixes from the other PR, now closed without merging.

In tst_rename.c there are some commented-out lines which show an unfixed bug:

      /* Open the file and rename the variable. This will remove
         * the dimscale-only dataset "longitude" and rename the
         * extisting dataset "lon" to "longitude". Variable
         * "longitude" will become a coordinate var. */
        /* if (nc_open(CHARLIE_TEST_FILE, NC_WRITE, &ncid)) ERR; */
        /* if (nc_redef(ncid)) ERR; */
        /* if (nc_rename_var(ncid, 0, LONGITUDE)) ERR; */
        /* if (nc_enddef(ncid)) ERR; */
        /* if (nc_close(ncid)) ERR; */

Still working on this one. ;-)

edhartnett avatar Jan 05 '18 13:01 edhartnett

I got as far as re-building and testing the latest netCDF master four days ago. It produced a total of 7 new and unexpected failures with the NCO regression test, half with ncks and half with ncpdq. All new failures involved netCDF4 files. I have not yet figured out the exact cause. Tracking that down has higher priority than testing the rename fix for now.

czender avatar Jan 09 '18 19:01 czender

Well it might be a good idea for us to use the NCO tests as another set of testing for netCDF...

On Tue, Jan 9, 2018 at 12:27 PM, Charlie Zender [email protected] wrote:

I got as far as re-building and testing the latest netCDF master four days ago. It produced a total of 7 new and unexpected failures with the NCO regression test, half with ncks and half with ncpdq. All new failures involved netCDF4 files. I have not yet figured out the exact cause. Tracking that down has higher priority than testing the rename fix for now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/597#issuecomment-356388116, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUr3EJohcZQc3k1PnO5TtXCfr2bvTYuks5tI72igaJpZM4QUOl- .

edhartnett avatar Jan 09 '18 19:01 edhartnett

NCO is extremely widely used in the climate community for working with netcdf files, so including them in your test suite would be wonderful.

cameronsmith1 avatar Jan 09 '18 19:01 cameronsmith1

We already do as part of our testing. Unfortunately there are two sets of tests: ‘make check’ which will return non-zero when there is a failure, and ‘make test’ which requires manual intervention to catch failures. We run the firmer but not the latter.

WardF avatar Jan 09 '18 22:01 WardF

@czender are you seeing failures in make check or make test? I am not seeing failures with make check. I will check with make test after my poster session.

WardF avatar Jan 09 '18 22:01 WardF

As you suspect it is with "make test", which is always what I mean by "the NCO regression test". For historical reasons, NCO's "make check" is basically just a check that the NCO C++ code library links and runs a single, simple program.

czender avatar Jan 09 '18 22:01 czender

After a little investigation it turns out that this was a problem with "make test" relying on a file, hdn.nc, that autoconf apparently does not re-create, it is not a netCDF or NCO code issue. Phew. Now I can move on to testing the rename patch...

czender avatar Jan 10 '18 00:01 czender

Excellent! I will hold off, let me know if any regressions pop up. Thank you @czender!

WardF avatar Jan 10 '18 19:01 WardF

@czender I've just put up another PR (#830) which I believe fixes the test case that you gave me for this issue.

Not sure if all possible rename issues are solved. But if you could try branch ejh_rename_is_here from https://github.com/NetCDF-World-Domination-Council/netcdf-c, and let me know if it now passes your rename tests, that would be great.

Of course, if it does not, I will need the next test case where it fails. ;-)

edhartnett avatar Feb 01 '18 12:02 edhartnett

Thank you for your edforts. It continues to fail on the test previously reported.

zender@skyglow:~$ ncks --lbr
Linked to netCDF library version 4.6.0.1 compiled Feb  1 2018 06:23:57
zender@skyglow:~$ ncrename -O -d lev,z -d lat,y -d lon,x ~/nco/data/in_grp.nc ~/foo.nc
zender@skyglow:~$ ncks -H --trd -s %d -v one ~/foo.nc
ncks: INFO nco_bld_dmn_ids_trv() reports variable </g5/one_dmn_rec_var> with duplicate dimensions
ncks: ERROR netCDF file with duplicate dimension IDs detected. Please use netCDF version at least 4.3.0. NB: Simultaneously renaming multiple dimensions with ncrename can trigger this bug with netCDF versions up to 4.6.0.1 (current as of 20180201).
ncks: INFO reports group information
/: 11 subgroups, 7 dimensions, 1 record dimensions, 5 attributes, 18 variables
 #3 dimension: 'coord'(2)
 #4 record dimension: 'time'(10)
 #5 dimension: 'vrt_nbr'(2)
 #6 dimension: 'gds_crd'(8)
 #7 dimension: 'y'(2)
 #8 dimension: 'x'(4)
 #9 dimension: 'z'(3)
/g1: 2 subgroups, 0 dimensions, 0 record dimensions, 1 attributes, 8 variables
/g1/g1g1: 0 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 5 variables
/g1/g1:g2: 0 subgroups, 0 dimensions, 0 record dimensions, 2 attributes, 0 variables
/g2: 0 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 3 variables
/g4: 0 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 4 variables
/g5: 0 subgroups, 2 dimensions, 2 record dimensions, 0 attributes, 3 variables
 #7 record dimension: 'time51'(2)
 #8 record dimension: 'time52'(10)
/g6: 1 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 2 variables
/g6/g6g1: 0 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 1 variables
/g7: 1 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 6 variables
/g7/g7g1: 0 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 1 variables
/g8: 1 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 0 variables
/g8/g8g1: 1 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 0 variables
/g8/g8g1/g8g1g1: 1 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 0 variables
/g8/g8g1/g8g1g1/g8g1g1g1: 0 subgroups, 0 dimensions, 0 record dimensions, 2 attributes, 0 variables
/g9: 1 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 0 variables
/g9/g9g1: 1 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 1 variables
/g9/g9g1/g9g1g1: 1 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 0 variables
/g9/g9g1/g9g1g1/g9g1g1g1: 1 subgroups, 0 dimensions, 0 record dimensions, 1 attributes, 0 variables
/g9/g9g1/g9g1g1/g9g1g1g1/g9g1g1g1g1: 1 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 0 variables
/g9/g9g1/g9g1g1/g9g1g1g1/g9g1g1g1g1/g9g1g1g1g1g1: 1 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 0 variables
/g9/g9g1/g9g1g1/g9g1g1g1/g9g1g1g1g1/g9g1g1g1g1g1/g9g1g1g1g1g1g1: 0 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 4 variables
/g10: 0 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 2 variables
/g11: 0 subgroups, 0 dimensions, 0 record dimensions, 0 attributes, 17 variables
/g12: 0 subgroups, 0 dimensions, 0 record dimensions, 20 attributes, 0 variables

ncks: INFO reports variable information
Segmentation fault (core dumped)

czender avatar Feb 01 '18 15:02 czender

Can you build with --enable-logging, and then call nc_set_log_level(5) before running those commands?

I have tried to build NCO here but could not get it to work.

edhartnett avatar Feb 01 '18 16:02 edhartnett