netcdf4-python icon indicating copy to clipboard operation
netcdf4-python copied to clipboard

OPeNDAP array over a certain size is completely full of zeros and not masked

Open smwCEH opened this issue 6 years ago • 7 comments

I'm using netCDF4 version 1.4.0 with Python 3.6.5 on Windows 10 to get data from an OPeNDAP endpoint from a THREDDS Data Server.

I've noticed that above a certain size the NumPy masked array returned when slicing the netCDF variable is completely full of zeros (0.0). The mask for the array is completely empty/blank, i.e. all cells in the mask are False.

The netCDF variable has the following properties: <class 'netCDF4._netCDF4.Variable'> float64 rainfall_amount(time, y, x) _FillValue: -999.0 standard_name: rainfall_amount units: kg m-2 scale_factor: 0.1 valid_min: -20.0 valid_max: 10000.0 long_name: CEH Gridded Estimates of Areal Rainfall comment: The estimated rainfall amount (kg m-2) is equivalent to the rainfall depth (mm). It is estimated where the closest raingauge is within 100km, otherwise not estimated and value set to -2. According to the British Standards Institution (BS 7843-4:2012), UK convention is for calculations of areal rainfall to be based on total measured precipitation, where measured precipitation includes both rainfall and the water equivalent of new snow. unlimited dimensions: current shape = (365, 1251, 701) filling off

For array of moderate size (e.g. (200, 200, 200), (250, 250, 250), (300, 300, 300)) I'm able to slice the netCDF variable to create a NumPy masked array with the expected data and mask but for larger arrays (e.g. (350, 350, 350), (365, 365, 365)) I get a NumPy masked array completely full of zeros (0.0) without any masked cells!

Is this a known issue when trying to slice netCDF variables from OPeNDAP endpoints? Does it relate to the masking and scaling defined for the variable or could it be related to the length of time taken by the OPeNDAP request/response?

I'm getting the same behaviour when I attempt the same slicing of the OPeNDAP netCDF variable when using xarray and dask arrays.

Many thanks, Simon.

smwCEH avatar Jul 10 '18 11:07 smwCEH

This is not a known problem, and from your description I cannot tell if it is an issue with the server, the C library or the python module.

If it's a problem with the C library then using nccopy to extract the data would give the same problem.

If it's an issue with the masking and scaling in the python module, then setting Dataset.set_auto_maskandscale(False) would make it go away.

dask and xarray use netcdf4-python under the hood, so it's not surprising the issue shows up there also.

Could you post a test script we can use to reproduce the issue?

jswhit avatar Jul 13 '18 01:07 jswhit

Here's some code to demonstrate my issue.

It creates netCDF4 Datasets and variables from a single netCDF file located on a SAN and the same single netCDF file exposed via a THREDDS OPeNDAP endpoint. You will need to provide your own netCDF file and THREDDS endpoints to test as obvioulsy you won't be able to access the SAN netCDF I'm using.

For larger array sizes the data returned from the OPeNDAP endpoint is an 3D array of zeros (0.0) :-(

Here's my code:

import os import datetime import gc import math

import time

import netCDF4 as nc print('\n\nnetCDF4.version:\t{0}'.format(nc.version))

import numpy as np import numpy.ma as ma print('\n\nnumpy.version:\t{0}'.format(np.version))

datetime_format = '%H:%M:%S %Y-%m-%d'

NODATA_VALUE = -9999.0

SCALE_FACTOR = 0.1

def summarise_array(array): print('\t\t\t\tdata.shape:\t\t\t\t\t{0}'.format(array.shape)) print('\t\t\t\tdata.size:\t\t\t\t\t{0}'.format(array.size)) print('\t\t\t\tdata.dtype:\t\t\t\t\t{0}'.format(array.dtype)) print('\t\t\t\tdata.nbytes:\t\t\t\t{0}'.format(array.nbytes)) print('\t\t\t\tma.count(data):\t\t\t\t{0}'.format(ma.count(array))) print('\t\t\t\tma.count_masked(data):\t\t{0}'.format(ma.count_masked(array))) print('\t\t\t\tma.min(data, axis=None):\t{0}'.format(ma.min(array, axis=None))) print('\t\t\t\tma.max(data, axis=None):\t{0}'.format(ma.max(array, axis=None)))

def netcdf4_single_netcdf_file(netcdf_file, netcdf_variable, z, y, x): t0 = time.clock() try: print('\t\t\tCreating netCDF4 Dataset...') dataset = nc.Dataset(filename=netcdf_file, mode='r') print('\t\t\t\ttype(dataset):\t\t\t\t{0}'.format(type(dataset))) print('\t\t\t\tdataset.data_model:\t{0}'.format(dataset.data_model)) # print('\t\t\tGetting NumPy MaskedArray...') data = dataset.variables[netcdf_variable][z[0]:z[1], y[0]:y[1], x[0]:x[1]] print('\t\t\t\ttype(data):\t\t\t\t\t{0}'.format(type(data))) summarise_array(data) # unique_values = np.unique(data) if len(unique_values) == 1 and unique_values[0] == 0.0: print('\t\t\t\t{0} All masked array values are equal to {1}! {0}'.format('*' * 5, unique_values[0])) # print('\t\t\tCalculating array mean...') mean = ma.mean(data, axis=None) print('\t\t\t\ttype(mean):\t\t\t\t\t{0}'.format(type(mean))) if data.any(): print('\t\t\t\tma.mean(data):\t{0:>12.4f}'.format(mean)) else: print('\t\t\t\tma.mean(data):\t{0}'.format(np.nan)) # del mean del data dataset.close() del dataset success = True except Exception as e: print('Exception:\t"{0}"'.format(e)) success = False t1 = time.clock() gc.collect() return (success, t1 - t0)

def netcdf4_single_opendap_file(opendap_url, netcdf_variable, z, y, x): t0 = time.clock() try: print('\t\t\tCreating netCDF4 Dataset...') dataset = nc.Dataset(filename=opendap_url, mode='r') print('\t\t\t\ttype(dataset):\t\t\t\t{0}'.format(type(dataset))) print('\t\t\t\tdataset.data_model:\t{0}'.format(dataset.data_model)) # print('\t\t\tGetting NumPy masked array...') data = dataset.variables[netcdf_variable][z[0]:z[1], y[0]:y[1], x[0]:x[1]] print('\t\t\t\ttype(data):\t\t\t\t\t{0}'.format(type(data))) summarise_array(data) # unique_values = np.unique(data) if len(unique_values) == 1 and unique_values[0] == 0.0: print('\t\t\t\t{0} All masked array values are equal to {1}! {0}'.format('*' * 5, unique_values[0])) # print('\t\t\tCalculating array mean...') mean = ma.mean(data, axis=None) print('\t\t\t\ttype(mean):\t\t\t\t\t{0}'.format(type(mean))) if data.any(): print('\t\t\t\tma.mean(data):\t{0:>12.4f}'.format(mean)) else: print('\t\t\t\tma.mean(data):\t{0}'.format(np.nan)) # del mean del data dataset.close() del dataset success = True except Exception as e: print('Exception:\t"{0}"'.format(e)) success = False t1 = time.clock() gc.collect() return (success, t1 - t0)

if name == "main":

start = datetime.datetime.now()
print('\n\nStarted at:\t\t{0}'.format(start.strftime(datetime_format)))

netcdf_single_file = r'Z:\\eidchub\\33604ea0-c238-4488-813d-0ad9ab7c51ca\\GB\\daily\\CEH_GEAR_daily_GB_2015.nc'
print('\n\nnetcdf_single_file:\t{0}'.format(netcdf_single_file))
print('netCDF file exists:\t{0}'.format(os.path.exists(netcdf_single_file)))

opendap_single_file = 'http://gds-ldb.nerc-lancaster.ac.uk/thredds/dodsC/public-CEHGEARGBDailyRainfallDetail1890-2015/CEH_GEAR_daily_GB_2015.nc'
print('\n\nopendap_single_file:\t{0}'.format(opendap_single_file))

netcdf_variable = 'rainfall_amount'
print('\n\nnetcdf_variable:\t{0}'.format(netcdf_variable))

dataset = nc.Dataset(filename=netcdf_single_file, mode='r')
# print(dataset[netcdf_variable].shape)
zs = range(dataset[netcdf_variable].shape[0])
ys = range(dataset[netcdf_variable].shape[1])
xs = range(dataset[netcdf_variable].shape[2])
dataset.close()

def get_bounds(minimum, maximum, value, size):
	lower_half = math.ceil(size * 0.5)
	upper_half = math.floor(size * 0.5)
	if value - lower_half < minimum:
		return (minimum, minimum + size)
	elif value + upper_half > maximum:
		return (maximum - size + 1, maximum + 1)
	else:
		return (value - lower_half, value + upper_half)

repeat = 1
print('repeat:\t{0}'.format(repeat))

z_mid = int((364 / 2) - 0)
y_mid = int((1250 / 2) - 1)
x_mid = int((700 / 2) - 1)
print('\n\nz_mid:\t{0}'.format(z_mid))
print('y_mid:\t{0}'.format(y_mid))
print('x_mid:\t{0}'.format(x_mid))

size_list = [200, 250, 300, 350, 364]

results_dict = {}

column = 0

column_names = []

for size in size_list:
	#
	print('\n\nsize:\t{0}'.format(size))
	#
	z_size, y_size, x_size = size, size, size
	print('\tz_size:\t{0}\n\ty_size:\t{1}\n\tx_size:\t{2}'.format(z_size, y_size, x_size))
	#
	timeit_list_netCDF4_single_netcdf = []
	timeit_list_netCDF4_single_opendap = []
	#
	for loop in range(repeat):
		#
		print('\tloop:\t{0} / {1}'.format(loop + 1, repeat))
		#
		z_bounds = (z_mid - int(size / 2), z_mid + int(size / 2))
		y_bounds = (y_mid - int(size / 2), y_mid + int(size / 2))
		x_bounds = (x_mid - int(size / 2), x_mid + int(size / 2))
		print('\t\tz_bounds:\t{0}'.format(z_bounds))
		print('\t\ty_bounds:\t{0}'.format(y_bounds))
		print('\t\tx_bounds:\t{0}'.format(x_bounds))
		#
		print('\t\tnetcdf4_single_netcdf_file():')
		t = netcdf4_single_netcdf_file(netcdf_single_file,
		                               netcdf_variable,
		                               z_bounds,
		                               y_bounds,
		                               x_bounds)
		timeit_list_netCDF4_single_netcdf.append(t[1] if t[0] else NODATA_VALUE)
		#
		print('\t\tnetcdf4_single_opendap_file():')
		t = netcdf4_single_opendap_file(opendap_url=opendap_single_file,
		                                netcdf_variable=netcdf_variable,
		                                z=z_bounds,
		                                y=y_bounds,
		                                x=x_bounds)
		timeit_list_netCDF4_single_opendap.append(t[1] if t[0] else NODATA_VALUE)
		#

print('\n\nDone.')
# Get finish time
finish = datetime.datetime.now()
print('\n\nFinished at:\t\t{0}'.format(finish.strftime(datetime_format)))
# Calculate and display elapsed time (time for script to complete)
elapsed = finish - start
hours, remainder = divmod(elapsed.total_seconds(), 3600)
minutes, seconds = divmod(remainder, 60)
print('\n\nElapsed time:\t\t{0}:{1}:{2:04.1f} seconds\n'.format(str(int(hours)).zfill(2), str(int(minutes)).zfill(2), seconds))

And here's the output:

C:\Continuum\anaconda3\envs\netcdf-performance2\python.exe G:/endows/python-netcdf-libraries-performance/single_netcdf_test_netCDF4.py

netCDF4.version: 1.4.0

numpy.version: 1.14.3

Started at: 11:43:15 2018-07-17

netcdf_single_file: Z:\eidchub\33604ea0-c238-4488-813d-0ad9ab7c51ca\GB\daily\CEH_GEAR_daily_GB_2015.nc netCDF file exists: True

opendap_single_file: http://gds-ldb.nerc-lancaster.ac.uk/thredds/dodsC/public-CEHGEARGBDailyRainfallDetail1890-2015/CEH_GEAR_daily_GB_2015.nc

netcdf_variable: rainfall_amount repeat: 1

z_mid: 182 y_mid: 624 x_mid: 349

size: 200 z_size: 200 y_size: 200 x_size: 200 loop: 1 / 1 z_bounds: (82, 282) y_bounds: (524, 724) x_bounds: (249, 449) netcdf4_single_netcdf_file(): Creating netCDF4 Dataset... type(dataset): <class 'netCDF4._netCDF4.Dataset'> dataset.data_model: NETCDF4 Getting NumPy MaskedArray... type(data): <class 'numpy.ma.core.MaskedArray'> data.shape: (200, 200, 200) data.size: 8000000 data.dtype: float64 data.nbytes: 64000000 ma.count(data): 5930800 ma.count_masked(data): 2069200 ma.min(data, axis=None): 0.0 ma.max(data, axis=None): 77.7124616329 Calculating array mean... type(mean): <class 'numpy.float64'> ma.mean(data): 2.6925 netcdf4_single_opendap_file(): Creating netCDF4 Dataset... type(dataset): <class 'netCDF4._netCDF4.Dataset'> dataset.data_model: NETCDF3_CLASSIC Getting NumPy masked array... type(data): <class 'numpy.ma.core.MaskedArray'> data.shape: (200, 200, 200) data.size: 8000000 data.dtype: float64 data.nbytes: 64000000 ma.count(data): 5930800 ma.count_masked(data): 2069200 ma.min(data, axis=None): 0.0 ma.max(data, axis=None): 77.7124616329 Calculating array mean... type(mean): <class 'numpy.float64'> ma.mean(data): 2.6925

size: 250 z_size: 250 y_size: 250 x_size: 250 loop: 1 / 1 z_bounds: (57, 307) y_bounds: (499, 749) x_bounds: (224, 474) netcdf4_single_netcdf_file(): Creating netCDF4 Dataset... type(dataset): <class 'netCDF4._netCDF4.Dataset'> dataset.data_model: NETCDF4 Getting NumPy MaskedArray... type(data): <class 'numpy.ma.core.MaskedArray'> data.shape: (250, 250, 250) data.size: 15625000 data.dtype: float64 data.nbytes: 125000000 ma.count(data): 10464500 ma.count_masked(data): 5160500 ma.min(data, axis=None): 0.0 ma.max(data, axis=None): 95.1648848684 Calculating array mean... type(mean): <class 'numpy.float64'> ma.mean(data): 2.9421 netcdf4_single_opendap_file(): Creating netCDF4 Dataset... type(dataset): <class 'netCDF4._netCDF4.Dataset'> dataset.data_model: NETCDF3_CLASSIC Getting NumPy masked array... type(data): <class 'numpy.ma.core.MaskedArray'> data.shape: (250, 250, 250) data.size: 15625000 data.dtype: float64 data.nbytes: 125000000 ma.count(data): 10464500 ma.count_masked(data): 5160500 ma.min(data, axis=None): 0.0 ma.max(data, axis=None): 95.1648848684 Calculating array mean... type(mean): <class 'numpy.float64'> ma.mean(data): 2.9421

size: 300 z_size: 300 y_size: 300 x_size: 300 loop: 1 / 1 z_bounds: (32, 332) y_bounds: (474, 774) x_bounds: (199, 499) netcdf4_single_netcdf_file(): Creating netCDF4 Dataset... type(dataset): <class 'netCDF4._netCDF4.Dataset'> dataset.data_model: NETCDF4 Getting NumPy MaskedArray... type(data): <class 'numpy.ma.core.MaskedArray'> data.shape: (300, 300, 300) data.size: 27000000 data.dtype: float64 data.nbytes: 216000000 ma.count(data): 16563900 ma.count_masked(data): 10436100 ma.min(data, axis=None): 0.0 ma.max(data, axis=None): 288.553941909 Calculating array mean... type(mean): <class 'numpy.float64'> ma.mean(data): 3.6284 netcdf4_single_opendap_file(): Creating netCDF4 Dataset... type(dataset): <class 'netCDF4._netCDF4.Dataset'> dataset.data_model: NETCDF3_CLASSIC Getting NumPy masked array... type(data): <class 'numpy.ma.core.MaskedArray'> data.shape: (300, 300, 300) data.size: 27000000 data.dtype: float64 data.nbytes: 216000000 ma.count(data): 27000000 ma.count_masked(data): 0 ma.min(data, axis=None): 0.0 ma.max(data, axis=None): 0.0 ***** All masked array values are equal to 0.0! ***** Calculating array mean... type(mean): <class 'numpy.float64'> ma.mean(data): nan

size: 350 z_size: 350 y_size: 350 x_size: 350 loop: 1 / 1 z_bounds: (7, 357) y_bounds: (449, 799) x_bounds: (174, 524) netcdf4_single_netcdf_file(): Creating netCDF4 Dataset... type(dataset): <class 'netCDF4._netCDF4.Dataset'> dataset.data_model: NETCDF4 Getting NumPy MaskedArray... type(data): <class 'numpy.ma.core.MaskedArray'> data.shape: (350, 350, 350) data.size: 42875000 data.dtype: float64 data.nbytes: 343000000 ma.count(data): 24022600 ma.count_masked(data): 18852400 ma.min(data, axis=None): 0.0 ma.max(data, axis=None): 319.23187251 Calculating array mean... type(mean): <class 'numpy.float64'> ma.mean(data): 4.3744 netcdf4_single_opendap_file(): Creating netCDF4 Dataset... type(dataset): <class 'netCDF4._netCDF4.Dataset'> dataset.data_model: NETCDF3_CLASSIC Getting NumPy masked array... type(data): <class 'numpy.ma.core.MaskedArray'> data.shape: (350, 350, 350) data.size: 42875000 data.dtype: float64 data.nbytes: 343000000 ma.count(data): 42875000 ma.count_masked(data): 0 ma.min(data, axis=None): 0.0 ma.max(data, axis=None): 0.0 ***** All masked array values are equal to 0.0! ***** Calculating array mean... type(mean): <class 'numpy.float64'> ma.mean(data): nan

size: 364 z_size: 364 y_size: 364 x_size: 364 loop: 1 / 1 z_bounds: (0, 364) y_bounds: (442, 806) x_bounds: (167, 531) netcdf4_single_netcdf_file(): Creating netCDF4 Dataset... type(dataset): <class 'netCDF4._netCDF4.Dataset'> dataset.data_model: NETCDF4 Getting NumPy MaskedArray... type(data): <class 'numpy.ma.core.MaskedArray'> data.shape: (364, 364, 364) data.size: 48228544 data.dtype: float64 data.nbytes: 385828352 ma.count(data): 26372164 ma.count_masked(data): 21856380 ma.min(data, axis=None): 0.0 ma.max(data, axis=None): 319.23187251 Calculating array mean... type(mean): <class 'numpy.float64'> ma.mean(data): 4.6069 netcdf4_single_opendap_file(): Creating netCDF4 Dataset... type(dataset): <class 'netCDF4._netCDF4.Dataset'> dataset.data_model: NETCDF3_CLASSIC Getting NumPy masked array... type(data): <class 'numpy.ma.core.MaskedArray'> data.shape: (364, 364, 364) data.size: 48228544 data.dtype: float64 data.nbytes: 385828352 ma.count(data): 48228544 ma.count_masked(data): 0 ma.min(data, axis=None): 0.0 ma.max(data, axis=None): 0.0 ***** All masked array values are equal to 0.0! ***** Calculating array mean... type(mean): <class 'numpy.float64'> ma.mean(data): nan

Done.

Finished at: 11:47:48 2018-07-17

Elapsed time: 00:04:33.5 seconds

Process finished with exit code 0

If I submit the similar requests (e.g. http://gds-ldb.nerc-lancaster.ac.uk/thredds/dodsC/public-CEHGEARGBDailyRainfallDetail1890-2015/CEH_GEAR_daily_GB_2015.nc?x[0:1:700],y[0:1:1250],time[0:1:364],rainfall_amount[0:1:0][0:1:0][0:1:0]) via the THREDDS OPeNDAP HTML interface the THREDDS server does return the data, so the request isn't exceeding the data volume configured for our THREDDS instance.

I'm guessing that this might be something to do with the length of time taken for the server to respond?

Many thanks for any help, best wishes, Simon.

smwCEH avatar Jul 17 '18 10:07 smwCEH

The opendap features of netcdf4-python are just a thin wrapper around the functionality of the C library. Basically, we pass either a filename or a URL to the C lib and all the work is done in the C library. Therefore, it's going to be very difficult to figure out whether this is an issue on the client side (in the netcdf C lib) or the server side. Either way, it's not an issue with the python interface.

jswhit avatar Jul 18 '18 17:07 jswhit

BTW, I experienced this and moving to an older libnetcdf usually "fixes" it. I'll try to come up with a small reproducible example to help us investigate this.

ocefpaf avatar Jul 18 '18 17:07 ocefpaf

For the record, I was trying to debug this issue (coming from https://github.com/pydap/pydap/issues/184) "the right way", by cygdb is so broken it's not even funny. I will try with good ol' print calls.

astrojuanlu avatar Jun 30 '19 18:06 astrojuanlu

My particular problem was originating in pydap, so probably not related to this issue.

astrojuanlu avatar Jun 30 '19 20:06 astrojuanlu

My particular problem was originating in pydap, so probably not related to this issue.

Probably not. The issue here is also not related to netCDF4. I recall we narrowed it down to netcdf-c low timeout values. We patched for that in conda-forge but I believe it was already fixed upstream.

B/c pydap does not depend on netcdf-c the problem is unrelated. (Buy you may check their timeout values too.)

ocefpaf avatar Jul 01 '19 16:07 ocefpaf