rosettasciio
rosettasciio copied to clipboard
Speeding Up Binary Data Reading
Speeding up binary data reading and offering more General support
I think that this is a very good step forward in centralizing many of the file readers that exist in different packages. That being said I think that reading the data could be greatly generalized.
This problems are more sensitive with 4-D STEM data simply because of size but I do think we should have a good way to read binary data and metadata that is consistent and easy to read. On top of that all of the binary datasets should at the least be properly loading using memory mapping and maybe also provide alternatives.
Dealing with Metadata
Let's start with the example of loading metadata from a binary file. I've been playing around with defining my metadata as a dictionary with a {"metadatakey":{"pos":530,"dtype":"u4"},...}
but this could also be a json or xml file which directly explains where in the file the metadata is located.
This can be then read by:
def seek_read(file, dtype, pos):
file.seek(pos)
return np.squeeze(np.fromfile(file, dtype, count=1))
metadata = {m: seek_read(f, mapping_dict[m]["pos"],mapping_dict[m]["dtype"]) for m in mapping_dict}
Then the metadata can be read with a simple function, even more complex dtypes like arrays can be read by defining the dtype correctly for numpy.
Ultimately that makes defining the metadata quite easy. Then you can easily map that metadata to be more cleaned for use further on.
Dealing with the Data
For reading in data I think that a similar approach can be used as well. If each signal in the dataset is defined than it can be easily memory mapped.
An example of this is:
dtype_list = [(("Array"), np.int16,(256,128)), ("sec","<u4"), ("ms","<u2"), ("mis","<u2"), ("Empty",bytes, 120)]
def read_binary(file, dtypes, offset, navigation_shape=None):
keys = [d[0] for d in dtypes]
mapped = np.memmap(file, offset=offset, dtype=dtypes, shape=navigation_shape)
binary_data = {k: mapped[k] for k in keys}
return binary_data
In this case trailing bytes are read efficiently and stored and it is generally clear what format the binary data exists. It also is fast and efficient at accessing the data in chunks of each signal.
Additional information:
I don't know if this ends up being the fastest way to read data (ultimately that becomes more of a challenge based on the system and if you are rechunking etc.) but there are some cases like with reading the empad detector where we are calling memmap and then reshaping the data with dask
that is fairly inefficient.
I would love it if @sk1p or @uellue would chime in here as well. I think that we could maybe generalize some of the other loading schemes they have for binary data using different hardware or streaming data. Hopefully with just a general set of loading methods it would be easy to call the file reader function with different backend readers and really optimize performance.
It would also make adding new formats easier and faster with a focus on maintaining speed and flexibility to try new loading schemes as file storage changes or adapts.