netcdf4-python
netcdf4-python copied to clipboard
Very slow slicing with libnetcdf 4.7.4
Array slicing with libnetcdf 4.7.4 is very slow. The same slicing with 4.7.3 is fast.
To reproduce this issue, first download this file:
(I can't generate a MWE file.)
And run this code:
from netCDF4 import Dataset
import time
nf=Dataset('woa18_A5B7_s16_04.nc', 'r')
t0=time.time()
for i in range(100):
for j in range(100):
q=nf['s_an'][0,:,i,j]
t=time.time()-t0
print('it took: '+str(t))
nf.close()
With this environment:
# Name Version Build Channel
libnetcdf 4.7.3 nompi_hc957ea6_101 conda-forge
netcdf4 1.5.3 nompi_py37h0154fc0_102 conda-forge
the code runs in ~3s.
With this environment:
# Name Version Build Channel
libnetcdf 4.7.4 nompi_hc957ea6_101 conda-forge
netcdf4 1.5.3 nompi_py38h5d7d79e_103 conda-forge
I kill the process before it terminates.
The release notes for libnetcdf 4.7.4 are here. I don't see anything in there that would obviously effect the read speed for slices.
I have confirmed that it is very slow with 4.7.4. There are 10,000 calls from python to the C routine nc_get_vara involved, so the more surprising thing to me is that it ran in only 3 seconds with version 4.7.3.
@WardF , do you know of any changes in 4.7.4 that could account for this?
Wow, not off the top of my head, but let me take a look and see if I can narrow this down. @dennisheimbigner does anything leap out at you?
I've created an issue on the C project page linking back to this issue.
There was an issue about strides > 1. Is this the case here?
strides are all 1 here (nc_get_vara is used)
I suspect it has something to do with the chunksizes and/or the zlib compression. If I rewrite the file in netCDF classic 64-bit format (nccopy -6
) the test script runs in 6 seconds.
Keeping the file netcdf4, but but converting the file to contiguous storage (without compression) also reduces the run time from >1000 to 6 seconds.
Is it possible for you to avoid the loop? If I use
q=nf['s_an'][0,:,0:100,0:100]
the test runs in less than 0.1 seconds.
I'm unsure---I made the example from a much larger script I inherited. I'll take a look.
Yes, the loop is avoidable, but to me it seems like a short-term solution (with the caveat that I know nothing about the internal workings of netCDF).
It's a performance regression in the C library. I've posted a C version of your test program to demonstrate this at Unidata/netcdf-c#1757.
There's been some more discussion over at the issue linked above, but we believe we have found the root cause (changes in default cache-related values). These can be worked around by modifying them at configure time in netcdf-c, and we will evaluate reverting to the old defaults, once I have refreshed my memories surrounding the PR that increased them in the first place.
Is this slowdown observed when reading similar data, across the board? Or is this particular file we're testing against unusually slow, but other files remain relatively fast?
My speculation would have been the the slowness is caused by whatever algorithm is used to find a match in the cache. But that should get slower with the number of cache elems, not the total size.
I'm following the discussion on the C side.
Tomorrow I'll test with a few other files.
I have tried using the Variable.set_var_chunk_cache
method to reset the chunk size for the variable before slicing it, but this has no effect. Unfortunately, there currently is no hook for nc_set_chunk_cache
to reset the values globally in the python interface. I did try adding it, and it does indeed fix the problem. The test script now looks like this:
from netCDF4 import Dataset, set_chunk_cache
import time
set_chunk_cache(4194304,1009)
nf=Dataset('woa18_A5B7_s16_04.nc', 'r')
t0=time.time()
for i in range(100):
for j in range(100):
q=nf['s_an'][0,:,i,j]
t=time.time()-t0
print('it took: '+str(t))
nf.close()
and it runs in a few seconds with both library versions 4.7.3 and 4.7.4.
PR #1019 introduces set_var_cache/get_var_cache
module functions.
Thanks @WardF for debugging this so quickly!
I tested 3 other files. The test runs equally well between libs for woa18_decav_t01_01.nc
from:
There's a difference for woa18_decav_t05_04.nc
from:
There's no difference for the gmt4 file from:
https://www.ngdc.noaa.gov/mgg/global/relief/ETOPO1/data/ice_surface/grid_registered/netcdf/
The lesson, I think, is that chunk parameters are highly data dependent, and it's better for the C library to be conservative when using default values, to allow for broadly-acceptable (if unoptimized) access speeds, vs. parameters that work great in some circumstances are are unworkable in others. I expect we will have a maintenance release of the C library in short order which reverts these parameters, barring some other solution presenting itself.
I must confess to be confused by your last comment. The issue, I thought, was the cache parameters and not the chunksizes themselves. Correct?