syncopy Using h5 region references to index trials

Using h5 region references to index trials

Open tensionhead opened this issue 2 years ago • 0 comments

Current Situation

At the moment we have an overloaded time axis: holding both the actual samples of a trial which actually has a time axis, and the trials even for data which does not have a time axis:

import syncopy as spy
import numpy as np

mockup = np.ones((15, 2))
ad = spy.AnalogData([mockup, mockup])
spec = spy.freqanalysis(ad)

Now looking at the shapes we get:

>>> ad.data.shape
(20, 2)
>>> spec.data.shape
(2, 1, 6, 2)

The time axis of the spec object is not a real time axis, but just holds the trials, whereas for ad it does both. This leads to an overkill of index gymnastics and therefore mental load which leads to error prone and not very readable/maintainable code. This is also reflected in our bug history, where the confusion of sample indices and trials led to numerous bugs, for example: #180, #207, #239 and #240

Proposed Solution

Using h5 Region References allows for clearly and elegantly decouple the time axis from the trials, yet still allows for the same stacking and overlapping in the backing hdf5 dataset:

import h5py

myfile = h5py.File("test", driver='core', mode='w')                                                     
myds = myfile.create_dataset('dset', (5, 2))                                                            
# define pseudo trials                                                                                  
myds[:3] = 1                                                                                            
myds[2:] = 2                                                                                            
# indicate overlapping region                                                                           
myds[2:3] = 12                                                                                          
# define (overlapping) trials via region references                                                                   
trl1 = myds.regionref[:3]                                                                               
trl2 = myds.regionref[2:]

Indexing is now straightforward:

>>> myds[trl1]
array([[ 1.,  1.],
       [ 1.,  1.],
       [12., 12.]], dtype=float32)
>>> myds[trl2]
array([[12., 12.],
      [ 2.,  2.],
      [ 2.,  2.]], dtype=float32)

Implementing this in our dataclasses would decrease code opaqueness, and vastly improve maintanability and the speed of feature additions. Also downstream from the pure dataclass implementations, streaming trials to the actual computations would be now almost trivial.

Implementation

Rather than creating these references only once, and then storing them along the actual data within the hdf containers, I suggest creating them on the fly from the trialdefinition provided by any Syncopy dataset. Something like this is anyways also happening in the current implementation to populate attributes like .trials. This would allow for a seamless integration into the current code base, and does not break compatibility to Syncopy data which is already saved somewhere.

I know that one argument for the current state is, that also without Syncopy the (raw) data is still somewhat directly accessible, IF ppl are able to parse the trialdefinition provided in the .info json file. I would argue that this isn't changing, as we still would provide the very same trialdefinitions in the respective .info files, and the actual arrays holding the data would not change a bit (pun intended) on disc compared to the current implementation :)

Mar 28 '22 12:03 tensionhead

syncopy syncopy copied to clipboard

Using h5 region references to index trials

Current Situation

Proposed Solution

Implementation

syncopy
syncopy copied to clipboard