netcdf-c Slicing very slow in 4.7.4 (reported)

Mirroring an issue from the python interface, which hooks into the underlying C library.

https://github.com/Unidata/netcdf4-python/issues/1018

In a nutshell, array slicing in 4.7.4 has been reported as being orders of magnitude slower than in 4.7.3. Creating an issue here in the C library to make it easier to keep track of.

Jun 11 '20 16:06 WardF

I have confirmed that the simple python test program in Unidata/netcdf4-python#1018 runs two orders of magnitude slower (using netcdf4-python master and hdf5 1.10.6) when netcdf-c 4.7.4 is linked, as opposed to netcdf-4.7.3. Should be fairly easy to create a simple C program to demonstrate this.

Jun 14 '20 18:06 jswhit

simple C program to test (using this file):

#include <stdio.h>
#include <string.h>
#include <netcdf.h>
#define FILE_NAME "woa18_A5B7_s16_04.nc"
#define NDIMS 4
#define NLAT 100
#define NLON 100
#define NLVL 102
#define VAR_NAME "s_an"
#define ERR(e) {printf("Error: %s\n", nc_strerror(e)); return 2;}
int
main()
{
   int ncid, varid, lat, lon, retval;
   size_t start[NDIMS], count[NDIMS];
   float data_out[NLVL];
   if ((retval = nc_open(FILE_NAME, NC_NOWRITE, &ncid)))
      ERR(retval);
   if ((retval = nc_inq_varid(ncid, VAR_NAME, &varid)))
      ERR(retval);
   count[0] = 1;
   count[1] = NLVL;
   count[2] = 1;
   count[3] = 1;
   start[0] = 0;
   start[1] = 0;
   start[2] = 0;
   start[3] = 0;
   for (lat = 0; lat < NLAT; lat++)
       for (lon = 0; lon < NLON; lon++)
   {
      start[2] = lat;
      start[3] = lon;
      if ((retval = nc_get_vara_float(ncid, varid, start,
                                      count, &data_out[0])))
         ERR(retval);
   }
   if ((retval = nc_close(ncid)))
      ERR(retval);
   return 0;
}

with netcdf-c 4.7.3:

% time ./a.out
1.119u 0.096s 0:01.81 66.2%	0+0k 0+0io 875pf+0w

with netcdf-c 4.7.4, I killed it after waiting 10 minutes.

Jun 14 '20 20:06 jswhit

gprof output of non-zero percent time:

Each sample counts as 0.01 seconds. % cumulative self self total
time seconds seconds calls us/call us/call name
45.45 0.05 0.05 10000 5.00 8.96 NC4_get_vars 27.27 0.08 0.03 20252 1.48 1.48 nclistget 9.09 0.09 0.01 10000 1.00 1.00 check_for_vara 9.09 0.10 0.01 main 9.09 0.11 0.01 nc4_find_nc4_grp

Jun 15 '20 04:06 DennisHeimbigner

Thanks @jswhit, taking a look at this now to see if I can sort out what happened, and when.

Jun 15 '20 17:06 WardF

I've got a test script set up that should give me an avenue for using git bisect to figure out where the change occured.

Jun 15 '20 19:06 WardF

Good. I did a diff comparison of 4.7.3 and 4.7.4, but could not see any obvious problem, so bisect is our only hope.

Jun 15 '20 20:06 DennisHeimbigner

I believe that you (Dennis) completely replaced the vara/vars code. Wasn't that between the 4.7.3 and 4.7.4 release? Or was that already in 4.7.3?

Jun 15 '20 20:06 edwardhartnett

Ok, I've found the problem. In commit a4465da9 we see a change in the cache sizes as defined in the cmake CMakeLists.txt file. I presume if I change the tests to use the autotools-based build system with the defaults, we'd see the same behavior (I'll check this).

The variables being set specifically are CHUNK_CACHE_SIZE and CHUNK_CACHE_NELEMS. The values before this commit were 4194304 and 1009 respectively.

The test/result matrix is as follows.

Status	CHUNK_CACHE_SIZE	CHUNK_CACHE_NELEMS
Fast	4194304	1009
Slow	16777216	4133
Slow	16777216	1009
Fast	4194304	4133

Clearly, the increased CHUNK_CACHE_SIZE is causing the slowdown. I vaguely remember the discussion around this PR, and I can go remind myself; @edwardhartnett, any thoughts on this?

@jswhit in the meantime, if the netCDF-C library (v4.7.4) sets the chunk_cache_size and chunk_cache_nelems to the values from v4.7.3 at configure time, we should see this issue go away.

Jun 15 '20 20:06 WardF

Well crap! Who would have thought?

And I did some performance testing, so perhaps this speeds performance in some cases but slows it in other cases.

OK firstly we can check this by having @jswhit build and try it and set these two values at configure time. The can be set with --with-chunk-cache-size= and --with-chunk-cache-nelems=. So try the original two numbers and see if this solves the problem.

I will ask the HDF5 support group about this, and perhaps they can provide some guidance.

Also I kind of wonder whether using even smaller numbers might improve performance even more?

Jun 15 '20 20:06 edwardhartnett

Wow- this is really counter-intuitive. Normally one expects larger caches to be faster. Does anyone have any insight into why the larger cache is slower?

Jun 15 '20 23:06 DennisHeimbigner

We're now seeing multiple reports; I expect we will be reverting the related changes to restore the old behavior, while we figure out why this is occurring and how to best take a second run at it.

Jun 22 '20 23:06 WardF

netcdf-c netcdf-c copied to clipboard

Slicing very slow in 4.7.4 (reported)

gprof output of non-zero percent time:

netcdf-c
netcdf-c copied to clipboard