netcdf-c
netcdf-c copied to clipboard
Slicing very slow in 4.7.4 (reported)
Mirroring an issue from the python interface, which hooks into the underlying C library.
- https://github.com/Unidata/netcdf4-python/issues/1018
In a nutshell, array slicing in 4.7.4 has been reported as being orders of magnitude slower than in 4.7.3. Creating an issue here in the C library to make it easier to keep track of.
I have confirmed that the simple python test program in Unidata/netcdf4-python#1018 runs two orders of magnitude slower (using netcdf4-python master and hdf5 1.10.6) when netcdf-c 4.7.4 is linked, as opposed to netcdf-4.7.3. Should be fairly easy to create a simple C program to demonstrate this.
simple C program to test (using this file):
#include <stdio.h>
#include <string.h>
#include <netcdf.h>
#define FILE_NAME "woa18_A5B7_s16_04.nc"
#define NDIMS 4
#define NLAT 100
#define NLON 100
#define NLVL 102
#define VAR_NAME "s_an"
#define ERR(e) {printf("Error: %s\n", nc_strerror(e)); return 2;}
int
main()
{
int ncid, varid, lat, lon, retval;
size_t start[NDIMS], count[NDIMS];
float data_out[NLVL];
if ((retval = nc_open(FILE_NAME, NC_NOWRITE, &ncid)))
ERR(retval);
if ((retval = nc_inq_varid(ncid, VAR_NAME, &varid)))
ERR(retval);
count[0] = 1;
count[1] = NLVL;
count[2] = 1;
count[3] = 1;
start[0] = 0;
start[1] = 0;
start[2] = 0;
start[3] = 0;
for (lat = 0; lat < NLAT; lat++)
for (lon = 0; lon < NLON; lon++)
{
start[2] = lat;
start[3] = lon;
if ((retval = nc_get_vara_float(ncid, varid, start,
count, &data_out[0])))
ERR(retval);
}
if ((retval = nc_close(ncid)))
ERR(retval);
return 0;
}
with netcdf-c 4.7.3:
% time ./a.out
1.119u 0.096s 0:01.81 66.2% 0+0k 0+0io 875pf+0w
with netcdf-c 4.7.4, I killed it after waiting 10 minutes.
gprof output of non-zero percent time:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
45.45 0.05 0.05 10000 5.00 8.96 NC4_get_vars
27.27 0.08 0.03 20252 1.48 1.48 nclistget
9.09 0.09 0.01 10000 1.00 1.00 check_for_vara
9.09 0.10 0.01 main
9.09 0.11 0.01 nc4_find_nc4_grp
Thanks @jswhit, taking a look at this now to see if I can sort out what happened, and when.
I've got a test script set up that should give me an avenue for using git bisect to figure out where the change occured.
Good. I did a diff comparison of 4.7.3 and 4.7.4, but could not see any obvious problem, so bisect is our only hope.
I believe that you (Dennis) completely replaced the vara/vars code. Wasn't that between the 4.7.3 and 4.7.4 release? Or was that already in 4.7.3?
Ok, I've found the problem. In commit a4465da9 we see a change in the cache sizes as defined in the cmake CMakeLists.txt file. I presume if I change the tests to use the autotools-based build system with the defaults, we'd see the same behavior (I'll check this).
The variables being set specifically are CHUNK_CACHE_SIZE and CHUNK_CACHE_NELEMS. The values before this commit were 4194304 and 1009 respectively.
The test/result matrix is as follows.
| Status | CHUNK_CACHE_SIZE | CHUNK_CACHE_NELEMS |
|---|---|---|
| Fast | 4194304 | 1009 |
| Slow | 16777216 | 4133 |
| Slow | 16777216 | 1009 |
| Fast | 4194304 | 4133 |
Clearly, the increased CHUNK_CACHE_SIZE is causing the slowdown. I vaguely remember the discussion around this PR, and I can go remind myself; @edwardhartnett, any thoughts on this?
@jswhit in the meantime, if the netCDF-C library (v4.7.4) sets the chunk_cache_size and chunk_cache_nelems to the values from v4.7.3 at configure time, we should see this issue go away.
Well crap! Who would have thought?
And I did some performance testing, so perhaps this speeds performance in some cases but slows it in other cases.
OK firstly we can check this by having @jswhit build and try it and set these two values at configure time. The can be set with --with-chunk-cache-size= and --with-chunk-cache-nelems=. So try the original two numbers and see if this solves the problem.
I will ask the HDF5 support group about this, and perhaps they can provide some guidance.
Also I kind of wonder whether using even smaller numbers might improve performance even more?
Wow- this is really counter-intuitive. Normally one expects larger caches to be faster. Does anyone have any insight into why the larger cache is slower?
We're now seeing multiple reports; I expect we will be reverting the related changes to restore the old behavior, while we figure out why this is occurring and how to best take a second run at it.