syncopy
syncopy copied to clipboard
Using h5 region references to index trials
Current Situation
At the moment we have an overloaded time
axis: holding both the actual samples of a trial which actually has a time axis, and the trials even for data which does not have a time axis:
import syncopy as spy
import numpy as np
mockup = np.ones((15, 2))
ad = spy.AnalogData([mockup, mockup])
spec = spy.freqanalysis(ad)
Now looking at the shapes we get:
>>> ad.data.shape
(20, 2)
>>> spec.data.shape
(2, 1, 6, 2)
The time
axis of the spec
object is not a real time axis, but just holds the trials, whereas for ad
it does both.
This leads to an overkill of index gymnastics and therefore mental load which leads to error prone and not very readable/maintainable code. This is also reflected in our bug history, where the confusion of sample indices and trials led to numerous bugs, for example: #180, #207, #239 and #240
Proposed Solution
Using h5 Region References allows for clearly and elegantly decouple the time axis from the trials, yet still allows for the same stacking and overlapping in the backing hdf5 dataset:
import h5py
myfile = h5py.File("test", driver='core', mode='w')
myds = myfile.create_dataset('dset', (5, 2))
# define pseudo trials
myds[:3] = 1
myds[2:] = 2
# indicate overlapping region
myds[2:3] = 12
# define (overlapping) trials via region references
trl1 = myds.regionref[:3]
trl2 = myds.regionref[2:]
Indexing is now straightforward:
>>> myds[trl1]
array([[ 1., 1.],
[ 1., 1.],
[12., 12.]], dtype=float32)
>>> myds[trl2]
array([[12., 12.],
[ 2., 2.],
[ 2., 2.]], dtype=float32)
Implementing this in our dataclasses would decrease code opaqueness, and vastly improve maintanability and the speed of feature additions. Also downstream from the pure dataclass implementations, streaming trials to the actual computations would be now almost trivial.
Implementation
Rather than creating these references only once, and then storing them along the actual data within the hdf containers, I suggest creating them on the fly from the trialdefinition
provided by any Syncopy dataset. Something like this is anyways also happening in the current implementation to populate attributes like .trials
. This would allow for a seamless integration into the current code base, and does not break compatibility to Syncopy data which is already saved somewhere.
I know that one argument for the current state is, that also without Syncopy the (raw) data is still somewhat directly accessible, IF ppl are able to parse the trialdefinition provided in the .info
json file. I would argue that this isn't changing, as we still would provide the very same trialdefinitions
in the respective .info
files, and the actual arrays holding the data would not change a bit (pun intended) on disc compared to the current implementation :)