netcdf-c
netcdf-c copied to clipboard
support for HDF5 dimension scales with null dataspace
Please see Unidata/netCDF4-python#1226. I cannot give you a reproducible example in C, I only know Python (and data formats ;).
FYI, I've suggested the same change at h5netcdf (which being pure python I can actually help with!) in h5netcdf/h5netcdf#205. The two compatibility requirements I suggested implementing on a dimension scale with null dataspace are:
- It must be attached, so that a size can be inferred from the attached dataset(s).
- All dataset axes to which it is attached must be of equal length.
I'm fairly sure the bug (causing Python to crash) I reported over in netcdf4-python does not have to do with a library mismatch. I re-wrote danger.h5
with the same h5py code from my original report, but this time I had built h5py against the conda installed libhdf5 obtained with the libnetcdf package (4.8.1, h42ceab0_1).
The same assertion also causes ncdump to abort:
$ ncdump danger.h5
ncdump: /opt/conda/conda-bld/libnetcdf_1642425793781/work/libhdf5/hdf5open.c:1299: get_scale_info: Assertion `ndims' failed.
Aborted (core dumped)
It would be nice to know if you can reproduce.
I've created a C-version of the problem. It might gain more attraction here in netcdf-c :grin:
#include "hdf5.h"
#define H5FILE_NAME "test_dim_scale_null.h5"
#define X "x"
#define Y "y"
#define N1 3
int
main (void)
{
hid_t file, dataset[2]; /* file, dataset handles */
hid_t nullType, floatType; /* datatypes */
hid_t nullSpace, floatSpace; /* dataspaces */
hsize_t dims[1]; /* dataset dimensions */
herr_t status;
int i;
float array[N1];
/* create h5 file */
file = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
/* initialize handles */
floatType= H5Tcopy(H5T_NATIVE_FLOAT);
H5Tset_precision(floatType, 32);
dims[0]= N1;
floatSpace= H5Screate_simple(1, dims, NULL);
/* initialize array */
for(i=0; i<N1; i++)
array[i] = i;
dataset[0] = H5Dcreate(file, Y, floatType, floatSpace,
H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
status = H5Dwrite(dataset[0], H5T_NATIVE_FLOAT, H5S_ALL, H5S_ALL,
H5P_DEFAULT, array);
/* create dimension scale */
nullType= H5Tcopy(H5T_NATIVE_INT);
// this breakes netcdf-c (ncdump)
nullSpace = H5Screate(H5S_NULL);
// this works as expected
//nullSpace = H5Screate_simple(1, dims, NULL);
dataset[1] = H5Dcreate(file, X, nullType, nullSpace,
H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
/* add dimension scale extensions */
H5DSset_scale(dataset[1], X);
/* add REFERENCE_LIST attribute */
H5DSattach_scale(dataset[0], dataset[1], 0);
/* Close/release resources. */
for(i=0; i<2; i++)
H5Dclose(dataset[i]);
H5Sclose(floatSpace);
H5Tclose(floatType);
H5Sclose(nullSpace);
H5Tclose(nullType);
H5Fclose(file);
return 0;
}
So, if I understand @itcarroll correctly, the proposal is to allow HDF5 Datasets with H5S_NULL
dataspace to be handled as netCDF4-dimensions.
I do not see a problem for point 1. But I'm a bit confused by point 2. Why should all datasets axes have to be of the same size? Couldn't that just be handled as unlimited dimension as the size is inferred from the attached dataset's sizes anyway?
Thank you, @kmuehlbauer! I think you are maybe missing an #include "hdf5_hl.h"
? But not for me to guess!
So, if I understand @itcarroll correctly, the proposal is to allow HDF5 Datasets with H5S_NULL dataspace to be handled as netCDF4-dimensions.
If you meant to say "Dimension Scales" instead of "Datasets", then you understand it correctly. I am not proposing to infer a dimension scale from a dataset that has not been set as one.
I hadn't thought about reading a null dataspace as a simple dataspace with unlimited size. If it makes the logic easier, it makes sense to me. My bigger concern is whether the dimension scale is considered a variable or not. That is, the above h5 file would read in CDL as:
netcdf test_dim_scale_null {
dimensions:
x = 3 ;
variables:
float y(x) ;
data:
y = 0, 1, 2 ;
}
Or as
netcdf test_dim_scale_null {
dimensions:
x = UNLIMITED ; // (3 currently)
variables:
float y(x) ;
data:
y = 0, 1, 2 ;
}
So, if I understand @itcarroll correctly, the proposal is to allow HDF5 Datasets with H5S_NULL dataspace to be handled as netCDF4-dimensions.
If you meant to say "Dimension Scales" instead of "Datasets", then you understand it correctly. I am not proposing to infer a dimension scale from a dataset that has not been set as one.
Yes, that's what I meant, an HDF5 Dataset with H5S_NULL dataspace which is made a Dimension Scale. Sorry for being unclear.
ping @WardF @DennisHeimbigner
Thanks for the ping @dopplershift, I'm looking at this now. To clarify my thinking as I read through this: As mentioned in the netcdf4-python
issue, it was never the intention that libnetcdf
be able to read arbitrary hdf5
files; however, the failure state should not be an unhandled exception, if at all possible. The question here isn't necessarily 'get netCDF to interpret the file correctly' but rather 'prevent a crash as previously described'.
I am on board with the assertion that the observed crash isn't how this should be handled, and I'm on board with what's being described here; I"ll need to think about it a little bit more, I've only just switched over from trying to track down race conditions in some of our testing. @DennisHeimbigner any thoughts on this, or an alternative approach we might take to handling this more gracefully?
Pinging @DennisHeimbigner; any thoughts on this?
While we wait, have you had any more thoughts, @WardF? Not crashing is, like you say, all libnetcdf is really required to do about this bug. But I am under the impression you don't think interpreting null dataspaces on dimension scales in some meaningful way is too much to ask. I can't think of any interpretations beyond what @kmuehlbauer and I have outlined, but I'm interested if there are others.
I am still trying to figure this out. But as an aside, is there an actual use case for a Null Dataspace
nullSpace = H5Screate(H5S_NULL);
The use case I have run into (and brought me to raise this issue) is model output written in HDF5 where H5DSattach_scale was used to attach a dimension scale to multiple datasets. The modeler thereby indicated the datasets shared this dimension, did not bloat the output with unneeded data for the dimension scale, and, I guess, avoided depending on (or did not know about?) netCDF's capability for representing shared dimensions. But I'm just repeating what I tried to already describe in the parent/python issue.
Googling about a little reveals another use case "in the wild" is storage of a missing scalar value (e.g. for string or integer attributes) in a way that is less ambiguous than a fill value.