rosettasciio
rosettasciio copied to clipboard
Saving Event Based Data
Describe the functionality you would like to see.
This is something that I've talked about offline a couple of times but it would be good to have a more serious discussion on this and get ahead of it before it is too late :). It would be nice to develop a standard for saving event based data. This is usually saved in something like 4 columns: x, y, intensity, time but there are potentially other ways that this can be saved.
For example you could have a counted dataset where you have a grouping of [x, y] positions and some associated time etc. Note that this would be equivalent to the ragged array storage that we currently have implemented.
If you want to create an image or a spectra then you would just integrate the events over some time period.
Describe the context
This type of data doesn't really conform very well to things like hdf5 or zarr. We could segment it into something like 1 second chunks or groups of 1,000,000 events (64bits41,000,000-> ~32 mb) which would make the recreation much faster. For compression sake you probably want some sort of chunk id which details the start time and the end time of some chunk. This is how Parquet Handles larger column based arrays.
As far as recreation goes, this is a pretty fun lazy computation problem. Each integrated image can be created lazily after you slice in time. You can reslice and won't be penalized as you haven't created the dataset.
Additional information
A couple of key points:
- The data is sorted by time which means that searching the data should be O(N) (for lots of points this is probably too slow and we want to pre-index the data somehow (see above for saving an time index or constant size index)
- This problem won't be embarrassingly parallel (unless we dictate something like a frame rate for the events) but it should be close with each chunk only having to be read at most 2 times.
I feel like this might make more sense in rosettasciio rather than other places (such as hyperspy etc.) as it would be good to handle the integration (or lack of integration) at loading rather than later. There is also the question of saving and operating on the data (do we keep it as event based data or at some point during the analysis is the dataset converted to an integrated dataset?)
Actionable Items:
- Is HDF5 or Zarr sufficiently capable of storing this type of data? (What other options are there?)
- Is prioritizing compression important for this kind of data? (What size of data can we expect, are there substaintial speed benefits from compression)
- How should integration be handled. (Using something like scipy.sparse, converting to a dask array etc?)
- How should saving be handled, should we only load files or save files as well.
@ericpre @francisco-dlp @sk1p @uellue @cophus @bbammes