cmor icon indicating copy to clipboard operation
cmor copied to clipboard

Performance issues with small chunks

Open cofinoa opened this issue 4 years ago • 18 comments

We are facing some performance issue accessing to metadata, i.e. values for time variable, because the number of I/O reading operations required to access all the chunks.

In particular the time coordinate variable it's created with chunk size 1, requiring one chunk per time value, therefore if in the netcdf-4 there is a lot of time steps (in 6-hr or 3-hr > 10k), the netcdf-4 library has too look and read for each chunk (i.e. 8 bytes per chunk).

A better explanation of this pitfall, can be found on [1]:

Chunks are too small There is a certain amount of overhead associated with finding chunks. When chunks are made smaller, there are more of them in the dataset. When performing I/O on a dataset, if there are many chunks in the selection, it will take extra time to look up each chunk. In addition, since the chunks are stored independently, more chunks results in more I/O operations, further compounding the issue. The extra metadata needed to locate the chunks also causes the file size to increase as chunks are made smaller. Making chunks larger results in fewer chunk lookups, smaller file size, and fewer I/O operations in most cases.

This relates to: #99, #100, #164

[1] https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/

cofinoa avatar May 06 '20 11:05 cofinoa

@cofinoa You indicated in one of the related postings that in netCDF3 making larger chunks for the time coordinate means that it can't be declared "unlimited". In netCDF4 is that also true or can it be declared "unlimited" and be made into bigger chunks? Thanks.

taylor13 avatar May 06 '20 14:05 taylor13

@mauzey1 is there a preset chunking value set in the code somewhere? I recall going over this in some detail many years ago, but a quick search of the repo for "chunk" doesn't appear to have shown any defaults, at least in my viewing

durack1 avatar May 06 '20 18:05 durack1

@taylor13 to mitigate the problem in netcdf-3 the only solution it's not to make unlimited time dimension.

In netcdf4/hdf5 you can select different chunksizes to make it bigger size for the time coordinate variable and chunksize of 1 for the principal variable.

cofinoa avatar May 06 '20 20:05 cofinoa

@durack1 and @mauzey1 the PR #100 just merge a change to impose a chunking size of 1 to time coordinate.

https://github.com/PCMDI/cmor/pull/100/commits/fc738dfa35e978721ae8ebb70a3627b82861961f

cofinoa avatar May 06 '20 20:05 cofinoa

@cofinoa - In netCDF4/HDF5, if you want a chunk size larger than 1 for an unlimited time dimension, do you have pass multiple time-slices (equal or larger than the chunk size) to be written in a single call to the netCDF library? If so, then I would say we shouldn't change the default from 1 because many people write their files one time slice at a time (i.e., they write a single time coordinate value and a corresponding data field that applies to that single time slice.

taylor13 avatar May 06 '20 20:05 taylor13

@cofinoa we tried to optimize the deflation, shuffling and chunking settings for the best performance vs file sizes. It is a difficult balancing act, as the only way to squeeze the best performance for output formats is to know both the 1) data that you're writing and 2) the use of this data once written before the file is created. We focused more on deflation (to minimize file sizes) rather than chunking (reading written data) as no default for chunking was defined in Balaji et al., 2018

Some of the history about this can be found in https://github.com/PCMDI/cmor/issues/135#issuecomment-282114427, #164, #403. Long story short, we opted to prioritize file size first, while selecting a chunking default that provided reasonable read performance for most use cases we anticipated.

If you had a better suggestion as to how these should be set, by deploying an algorithm to assess the data being written this would be a useful update.

I note there are some comments about the version of the netcdf library playing a role in slow read speeds, see https://github.com/Unidata/netcdf-c/issues/489

This ref was also an interesting find https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf_4_chunking_performance_results

durack1 avatar May 06 '20 22:05 durack1

@taylor13 with respecto to:

In netCDF4/HDF5, if you want a chunk size larger than 1 for an unlimited time dimension, do you have pass multiple time-slices (equal or larger than the chunk size) to be written in a single call to the netCDF library?

No. The unlimited dimension logical size will increased independen from chunk size.

@durack1, about:

shuffling and chunking settings for the best performance vs file sizes. It is a difficult balancing act, as the only way to squeeze the best performance for output formats is to know both the 1) data that you're writing and 2) the use of this data once written before the file is created. We focused more on deflation (to minimize file sizes) rather than chunking (reading written data) as no default for chunking was defined in Balaji et al., 2018

I agree and I'm not proposing to modify the chunking properties (size, deflate, shuffling, ...) for the principal netcdf variable (i.e. tas). Those performance and size optimizations analysis, focus on the accessing (read/write) the actual data (principal variable) . The problem on performance I'm raising it's about exploring the netcdf metadata and coordinates, which is been affected by the chunking`storage` strategy used for them which is independent from the strategy for the principal variable. The issue #164 just mention to put chunking size equals to 1, but no performance impact was considered, is what I'm proposing to fix it.

To support my point, I have defined a netcdf-4/hdf5 with just one unlimited dimension, and 2 variables with 2 different chunks :

netcdf example {
    dimensions:
        time = UNLIMITED ; // (2 currently)
    variables:
        double time(time) ;
            time:_ChunkSizes = 10 ;
        double par(time) ;
            par:_ChunkSizes = 1 ;
    data:
        time = 1, 2 ;
        par = 1, 4 ;
}

The par variable is the principal variable and the time is coordinate variable, both uses time as unlimited dimension and currently with size 2. But both use different chunksize, 1 and 10 respectively.

You can generate the actual netcdf file with the above CDL:

$ ncgen -7 example.cdl

and compile this simplistic (no error control, ....) program which add a value to each variable on the unlimited dimension every time is executed:

#include <netcdf.h>

int main() {
    int  ncid, time_dimid, time_varid, par_varid;
    size_t time_len, pos[1];
    double value;

    nc_open("example.nc", NC_WRITE, &ncid);
    
    nc_inq_dimid(ncid, "time", &time_dimid);
    nc_inq_dimlen(ncid, time_dimid, &time_len);

    pos[0] = time_len;

    value = (double) time_len * 2 ;
    nc_inq_varid(ncid, "time", &time_varid);
    nc_put_var1_double(ncid, time_varid, pos, &value);
    
    value = value * 2;
    nc_inq_varid(ncid, "par", &par_varid);
    nc_put_var1_double(ncid, par_varid, pos, &value);

    nc_close(ncid);
}

If you execute it:

$ ./addOneValue

The content of the existing netcdf file will be:

netcdf example {
    dimensions:
        time = UNLIMITED ; // (3 currently)
    variables:
        double time(time) ;
            time:_ChunkSizes = 10 ;
        double par(time) ;
            par:_ChunkSizes = 1 ;
    data:
        time = 1, 2, 4 ;
        par = 1, 4, 8 ;
}

With respect to Unidata/netcdf-c#489 issue it mentions performance issues with metadata, but it relates to number netcdf entities itself (variables, attributes, dimension) and the library strategy to cache them when netcdf file it's open.

Hope this helps. Let me know if you need more info.

cofinoa avatar May 07 '20 09:05 cofinoa

@cofinoa in the https://github.com/PCMDI/cmor/issues/601#issuecomment-625134068 above there was no obvious next step regarding chunking coordinate variables. Have I missed something? As noted in #164, this is currently set at 1, what is your proposal (and what is the performance improvement with this)?

durack1 avatar May 07 '20 13:05 durack1

thank you @cofinoa for providing all this good background and information and bringing to our attention the performance issue in reading time-coordinates only.

If we can write individual time-slices and their associated time-coordinate value one at a time to a file (i.e., in separate calls to the nc "write" function), then I agree that a vector of coordinates values should probably never be "chunked", i.e., the entire vector of coordinate values should be written as a single chunk. I wouldn't think changing the default for chunking of coordinates would be that difficult, and it would apply to the "unlimited" time coordinate as well as other "limited" coordinates.

It appears no changes would be needed for the chunking of the data array itself.

Please let us know if this would be satisfactory.

taylor13 avatar May 07 '20 14:05 taylor13

@taylor13, yes, the data array (principal variable) it's not been affected. Its chunking strategy it's a different discussion.

@durack1 my proposal it's to define a chunk size balancing size issues (#164) and performance. The performance issue it's being explained at the issue description with an excerpt from HDF5 which explains the performance issue with small chunks size.

Currently, the netcdf-c library defines a DEFAULT_CHUNK_SIZE of 4MB, for general case, but for unlimited 1D variables has a DEFAULT_1D_UNLIM_SIZE of 4KB. See [1].

Then for the time coordinate variable, the chunk size can be 512, and with _DeflateLevel=1 to mitigate wasted space (4KB at most) on no filled chunks (issues #164 and #99):

netcdf example {
    dimensions:
        time = UNLIMITED ; // (2 currently)
    variables:
        double time(time) ;
            time:_ChunkSizes = 512 ;
            time:_DeflateLevel = 1 ;
        double par(time) ;
            par:_ChunkSizes = 1 ;
    data:
        time = 1, 2 ;
        par = 1, 4 ;
}

This will reduce chunk search and I/O in a factor (maximum) of 512 (see [2]).

[1] https://github.com/Unidata/netcdf-c/blob/15e1bbbd43e5deede72c34ad0674083c7805b6bd/libhdf5/hdf5var.c#L191-L227 [2] https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/

cofinoa avatar May 07 '20 18:05 cofinoa

@cofinoa this issue has been stale for ~4 years, so will close. If there are additional tweaks that make sense, please comment and reopen

durack1 avatar Apr 07 '24 16:04 durack1

Perhaps, the suggested changes should be implemented prior to closing?

taylor13 avatar Apr 08 '24 17:04 taylor13

@durack1 as you pointed, it has been stale for a long period, but I don't know if it has been considered to be taken into account for the next "release" of the archiving specifications for data producers and its status of implementation (as @taylor13 suggest).

cofinoa avatar Apr 08 '24 18:04 cofinoa

@cofinoa to be honest, your suggestions are probably better directed at updating defaults for the netcdf-c library, as CMOR is a downstream user of this.

If there is some obvious defaults that could be updated in CMOR, which optimizes file sizes and file/variable accesses then this would be useful to incorporate.

Reading the above, it is not obvious to me what is required to fully address the issue - if you wanted to submit a PR for consideration this would be the fastest path to a solution.

As I noted, feel free to reopen if you wanted to submit a PR

durack1 avatar Apr 08 '24 18:04 durack1

@durack1 I have opened the PR #733 where I guess the fixing for the CMOR should be applied.

The issue it's not with netCDF-C library, the issue it's with CMOR itself where assumption of having unlimited dimensions, enforces chunking A) with size 1 on unlimited diemsion and B) same chunking size for all netcdf vars which shared the unlimited dimension in the same file. This asumption it's right for netCDF3 data and storage model, not any more for netCDF4 data and storage model.

@taylor13 and @durack1 I would like also to suggest introducing a recommendation on this issue for DATA producers, when they start to encode data for the next CMIP7, but I don't know where is the appropriate forum: https://pcmdi.llnl.gov/CMIP6/Guide/modelers.html#7-archivingpublishing-output

cofinoa avatar Apr 09 '24 14:04 cofinoa

@durack1 I can't re-open this issue, can you re-open it for me?

cofinoa avatar Apr 09 '24 14:04 cofinoa

@cofinoa thanks for PR #733, we'll pull that in and see if there are any impacts across the test suite and some usage file sizes, and merge into the planned 3.9.0 release next month if everything checks out great

durack1 avatar Apr 09 '24 14:04 durack1

#733 merges the changes, but we need to add a test to ensure that we're not a) breaking anything, and b) not causing performance issues for "standard" datasets - for 3.9

durack1 avatar Apr 29 '24 16:04 durack1