h5netcdf icon indicating copy to clipboard operation
h5netcdf copied to clipboard

very slow partial reading when saved with index shift

Open master-nemo opened this issue 3 years ago • 8 comments

What happened:

When saved two group of data with shifted index like in code fragment ***1, partial reading will be very slow in future. Same index like ***2 - works normal. Note: displaying shape in sample is just to shorten the output. A normal reading to variable behaves similarly.

upd2: fixed sample and notes.

What you expected to happen:

fast reading for all cases

MCVE Code Sample

 
# ------------------------------ saving part ----------------------------- #
import numpy as np
import h5netcdf.legacyapi as netCDF4
ofname=r'_o_.nc'
h,w=720,1440
reshap3=[w,h,1]

def calc1(*args): return np.random.random((h,w))
def calc_data_group2(*args): return np.random.random((h,w)),np.random.random((h,w))

import h5netcdf.legacyapi as netCDF4
with netCDF4.Dataset(ofname,'w',phony_dims='sort') as ds:
    
    x = ds.createDimension('x',w) 
    y = ds.createDimension('y',h) 
    q = ds.createDimension('q',None) 
    
    vx = ds.createVariable('x', 'f4', ('x',)) 
    vy = ds.createVariable('y', 'f4', ('y',)) 
    v_q = ds.createVariable('q', 'f8', ('q',)) 
    
    
    v_data1 = ds.createVariable('data', float, ('x','y','q') )

    v_U = ds.createVariable('U', float, ('x','y','q') )
    v_V = ds.createVariable('V', float, ('x','y','q') )
    
    q=0 
    prev=None
    for dat in range(60):   # trying for 300+,100,30 fragments of data. even for 30 reading is noticeble slow
        d1=calc1(dat)   #something like this with my data
        v_data1[:,:,q] = d1.reshape(reshap3)    #saving 1st vaiable 
            
        if q-2>=0: 
            U,V = calc_data_group2(prev,d1) #something like this with my data
            
            #                              saving second group of data
            # ***1
            v_U[:,:,q-2] = U[:,:].T.reshape(reshap3)  # saved at index q-2
            v_V[:,:,q-2] = V[:,:].T.reshape(reshap3)  # (!) very slow reading when saved like here
            # ***2
            # v_U[:,:,q] = U[:,:].T.reshape(reshap3)  # saved at index q
            # v_V[:,:,q] = V[:,:].T.reshape(reshap3)  # normal fast readig when saving at index q
        prev=d1
        q+=1    #main index of out data


# %%
with netCDF4.Dataset((ofname),'r',phony_dims='sort') as ds:
# ...dimensions read skipped...

    v_data1 = ds.variables.get('data')
    v_U      = ds.variables.get('U')  
    v_V      = ds.variables.get('V')  
    
    print("v_U.shape",v_U.shape)   
    print("v_data1",v_data1[:,:,0].shape)   # This reading always OK  
    print("v_data1",v_data1[:,:,1].shape)   # This reading always OK  
    print("v_data1",v_data1[:,:,-1].shape)  # This reading always OK  
    # ***3
    print("v_U",v_U[:,:,0].shape)           # This reading is very slow for saving with shifted indexes
    print("v_U",v_U[:,:,1].shape)           # This reading is very slow for saving with shifted indexes
    for k in range(10): 
        print(k,"v_U",v_U[:,:,k].shape)     # This reading is very slow for saving with shifted indexes        
        

Version

'1.0.2'

master-nemo avatar Sep 18 '22 09:09 master-nemo

@master-nemo Thanks for raising this and thanks for the extensive example.

The root cause is the automatic padding here: https://github.com/h5netcdf/h5netcdf/blob/c6d20162adddeb9c3c55ff39e325d11d471a6822/h5netcdf/core.py#L299-L310

If a variable with unlimited dimensions isn't written completely (as in the above MCVE, 58 written, but size 60) the data is padded with the underlying fillvalue before extraction.

This was implemented to mimic the netcdf-c/netcdf4-python behaviour. Unfortunately it is not as performant as it should be.

kmuehlbauer avatar Sep 19 '22 08:09 kmuehlbauer

thanks. i will try to deal with it then)

master-nemo avatar Sep 19 '22 08:09 master-nemo

Yeah, but we might improve the current code to only pad if really needed. As of now the padding is applied in any case, which slows down your processing.

kmuehlbauer avatar Sep 19 '22 08:09 kmuehlbauer

The best solution to this problem would be to create the underlying dataset with the wanted netcdf-c/netcdf4-python fillvalues.

import netCDF4
netCDF4.default_fillvals
{'S1': '\x00',
 'i1': -127,
 'u1': 255,
 'i2': -32767,
 'u2': 65535,
 'i4': -2147483647,
 'u4': 4294967295,
 'i8': -9223372036854775806,
 'u8': 18446744073709551614,
 'f4': 9.969209968386869e+36,
 'f8': 9.969209968386869e+36}

The test for padding would need to be adapted slightly. Then padding will only be applied if data is requested from that uninitialized region which is behind the datasets size.

kmuehlbauer avatar Sep 19 '22 11:09 kmuehlbauer

thanks to all! I reserved space by saving frame of NaNs and it helps.

master-nemo avatar Sep 20 '22 10:09 master-nemo

@master-nemo Glad you found a workaround for your use-case. Nevertheless it would be good to fix this within the package. I'll leave it open for now until fixed.

kmuehlbauer avatar Sep 20 '22 11:09 kmuehlbauer

maybe this helps: i trying many sizes and noticed nearly linear correlation with reading time (in shifted case). Seems like it trying to read whole array to memory.

master-nemo avatar Sep 20 '22 12:09 master-nemo

Yes, that's actually the case. If the size of the dimension and the size of the variables dimension do not match, the whole array is read to0 memory before subsetting.

I think, I'll have a fix out in a few days. Thanks again, @master-nemo.

kmuehlbauer avatar Sep 20 '22 12:09 kmuehlbauer

resolved by #183

kmuehlbauer avatar Nov 23 '22 06:11 kmuehlbauer

Sorry for disturbing you again. New version worked pretty fine, but there is one point where it fails: when i trying to create f2 variable, it trying to get f2 from default_fillvals... so:


Exception has occurred: KeyError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
'f2'
  File "C:\dev\Python38\Lib\site-packages\h5netcdf\legacyapi.py", line 29, in _get_default_fillvalue
    fillvalue = default_fillvals[f"{kind}{size}"]
  File "C:\dev\Python38\Lib\site-packages\h5netcdf\core.py", line 734, in _create_child_variable
    fillval = _get_default_fillvalue(dtype)
  File "C:\dev\Python38\Lib\site-packages\h5netcdf\core.py", line 868, in create_variable
    return group._create_child_variable(
  File "C:\dev\Python38\Lib\site-packages\h5netcdf\legacyapi.py", line 224, in createVariable
    return super(Group, self).create_variable(	
...

master-nemo avatar Jan 15 '23 17:01 master-nemo