h5netcdf
h5netcdf copied to clipboard
very slow partial reading when saved with index shift
What happened:
When saved two group of data with shifted index like in code fragment ***1, partial reading will be very slow in future. Same index like ***2 - works normal.
Note: displaying shape in sample is just to shorten the output. A normal reading to variable behaves similarly.
upd2: fixed sample and notes.
What you expected to happen:
fast reading for all cases
MCVE Code Sample
# ------------------------------ saving part ----------------------------- #
import numpy as np
import h5netcdf.legacyapi as netCDF4
ofname=r'_o_.nc'
h,w=720,1440
reshap3=[w,h,1]
def calc1(*args): return np.random.random((h,w))
def calc_data_group2(*args): return np.random.random((h,w)),np.random.random((h,w))
import h5netcdf.legacyapi as netCDF4
with netCDF4.Dataset(ofname,'w',phony_dims='sort') as ds:
x = ds.createDimension('x',w)
y = ds.createDimension('y',h)
q = ds.createDimension('q',None)
vx = ds.createVariable('x', 'f4', ('x',))
vy = ds.createVariable('y', 'f4', ('y',))
v_q = ds.createVariable('q', 'f8', ('q',))
v_data1 = ds.createVariable('data', float, ('x','y','q') )
v_U = ds.createVariable('U', float, ('x','y','q') )
v_V = ds.createVariable('V', float, ('x','y','q') )
q=0
prev=None
for dat in range(60): # trying for 300+,100,30 fragments of data. even for 30 reading is noticeble slow
d1=calc1(dat) #something like this with my data
v_data1[:,:,q] = d1.reshape(reshap3) #saving 1st vaiable
if q-2>=0:
U,V = calc_data_group2(prev,d1) #something like this with my data
# saving second group of data
# ***1
v_U[:,:,q-2] = U[:,:].T.reshape(reshap3) # saved at index q-2
v_V[:,:,q-2] = V[:,:].T.reshape(reshap3) # (!) very slow reading when saved like here
# ***2
# v_U[:,:,q] = U[:,:].T.reshape(reshap3) # saved at index q
# v_V[:,:,q] = V[:,:].T.reshape(reshap3) # normal fast readig when saving at index q
prev=d1
q+=1 #main index of out data
# %%
with netCDF4.Dataset((ofname),'r',phony_dims='sort') as ds:
# ...dimensions read skipped...
v_data1 = ds.variables.get('data')
v_U = ds.variables.get('U')
v_V = ds.variables.get('V')
print("v_U.shape",v_U.shape)
print("v_data1",v_data1[:,:,0].shape) # This reading always OK
print("v_data1",v_data1[:,:,1].shape) # This reading always OK
print("v_data1",v_data1[:,:,-1].shape) # This reading always OK
# ***3
print("v_U",v_U[:,:,0].shape) # This reading is very slow for saving with shifted indexes
print("v_U",v_U[:,:,1].shape) # This reading is very slow for saving with shifted indexes
for k in range(10):
print(k,"v_U",v_U[:,:,k].shape) # This reading is very slow for saving with shifted indexes
Version
'1.0.2'
@master-nemo Thanks for raising this and thanks for the extensive example.
The root cause is the automatic padding here: https://github.com/h5netcdf/h5netcdf/blob/c6d20162adddeb9c3c55ff39e325d11d471a6822/h5netcdf/core.py#L299-L310
If a variable with unlimited dimensions isn't written completely (as in the above MCVE, 58 written, but size 60) the data is padded with the underlying fillvalue before extraction.
This was implemented to mimic the netcdf-c/netcdf4-python behaviour. Unfortunately it is not as performant as it should be.
thanks. i will try to deal with it then)
Yeah, but we might improve the current code to only pad if really needed. As of now the padding is applied in any case, which slows down your processing.
The best solution to this problem would be to create the underlying dataset with the wanted netcdf-c/netcdf4-python fillvalues.
import netCDF4
netCDF4.default_fillvals
{'S1': '\x00',
'i1': -127,
'u1': 255,
'i2': -32767,
'u2': 65535,
'i4': -2147483647,
'u4': 4294967295,
'i8': -9223372036854775806,
'u8': 18446744073709551614,
'f4': 9.969209968386869e+36,
'f8': 9.969209968386869e+36}
The test for padding would need to be adapted slightly. Then padding will only be applied if data is requested from that uninitialized region which is behind the datasets size.
thanks to all! I reserved space by saving frame of NaNs and it helps.
@master-nemo Glad you found a workaround for your use-case. Nevertheless it would be good to fix this within the package. I'll leave it open for now until fixed.
maybe this helps: i trying many sizes and noticed nearly linear correlation with reading time (in shifted case). Seems like it trying to read whole array to memory.
Yes, that's actually the case. If the size of the dimension and the size of the variables dimension do not match, the whole array is read to0 memory before subsetting.
I think, I'll have a fix out in a few days. Thanks again, @master-nemo.
resolved by #183
Sorry for disturbing you again. New version worked pretty fine, but there is one point where it fails: when i trying to create f2 variable, it trying to get f2 from default_fillvals... so:
Exception has occurred: KeyError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
'f2'
File "C:\dev\Python38\Lib\site-packages\h5netcdf\legacyapi.py", line 29, in _get_default_fillvalue
fillvalue = default_fillvals[f"{kind}{size}"]
File "C:\dev\Python38\Lib\site-packages\h5netcdf\core.py", line 734, in _create_child_variable
fillval = _get_default_fillvalue(dtype)
File "C:\dev\Python38\Lib\site-packages\h5netcdf\core.py", line 868, in create_variable
return group._create_child_variable(
File "C:\dev\Python38\Lib\site-packages\h5netcdf\legacyapi.py", line 224, in createVariable
return super(Group, self).create_variable(
...