rosettasciio
rosettasciio copied to clipboard
Support more features of the mib reader
Support for mib files (quantum detector) has been added in #174 and this is a list of features that would be adding:
- [x] infer navigation shape from flyback - reading exposure time is implemented but test files are needed - #235.
- [x] add support for reading ROI - #289.
- [ ] add support for reading RAW data
- [ ] Support single acquisition with a number of frame to skip per line - see #270?
I'm very interested all of these, if anyone needs test files we can definitely supply that. Maybe also a pull request if someone has time.
As it says the first one should be very easy, the two others would need a volunteer!
For the last point, we have (MIT-licensed) decoders for most raw MIB formats, which should be re-usable, for example in the LiberTEM-live project.
That includes decoding files taken with quad arrangements, but not yet arbitrary arrangements (1x4 etc.). What's also missing is support for 24bit raw files, where two consecutive frames are sent for the high and low bits (IIRC).
Feel free to take these, or take inspiration as needed.
Many thanks for adding this! @ericpre
Just to note that you seem to have two methods to read mib files now, one file_reader and one load_mib_data. The file_reader is applying a up-down flip on the diffraction patterns here. Would be good to add this to the metadata.
Also just wondering about the plans for adding the cross gap for quad configurations. Will this be re-added to pyxem?
Many thanks for adding this! @ericpre Just to note that you seem to have two methods to read mib files now, one
file_readerand oneload_mib_data. Thefile_readeris applying a up-down flip on the diffraction patterns here. Would be good to add this to the metadata. Also just wondering about the plans for adding the cross gap for quad configurations. Will this be re-added to pyxem?
@M0hsend We can definitely add this back to pyxem! I just need to understand what a good workflow would be and what information is/isn't available. I don't have/use a Merlin detector so supporting some of these features is a bit difficult without a bit more direction :).
Is the cross gap information stored in the metadata? Is the cross gap information constant or could you arrange the detectors anyway you want? I think we would want something like an map_to_detector method for pyxem which would add pixels with np.nan values at a certain fraction of the dataset where the cross is. We could think about creating a Dectector class which defines a layout for a certain detector and then maps certain pixels from the original to the transformed dataset. That gives us much more freedom to start thinking about how to handle any possible detector shape.
Hi @CSSFrancis , The detector configuration is stored in the header section. When you load a quad Merlin data using data = file_reader(path, lazy = False) then for the quad case you get data[0]['original_metadata']['mib_properties']['detector_geometry'] as 2x2. This is a physical gap between the chips (3 pixels wide) that is not actively reading data. So if this gap is not included in the diffraction relative distances / angles are not quite right. So I wonder if it is better suited to be included here as I doubt anyone would want to read wrong data.
The pyxem method could be to somehow fill that cross gap with various interpolation algorithms?
According to the manual, there is a setting on the acquisition computer, the Layout parameter in Config.ini and the string could be one of [2x2, Nx1, 2x2G, Nx1G] where G indicate that the gap has been added.
I checked some data with the gap added and the G is not there... which mean that we can rely on this metadata! @matkraj, is it correct/expected?
The file_reader is applying a up-down flip on the diffraction patterns here. Would be good to add this to the metadata.
@M0hsend, why do you think this would be good? Are they different image orientation possible?
One motivation to do that in pyxem is that adding the gap mess up the data chunking and it would expect that it make sense to do it at the same time as the correction - but I didn't checked if this statement is correct and how bad is the performance penalty.
From @matkraj in https://github.com/pyxem/pyxem/issues/634
Hi all, Is there any reason hyperspy/rosettasciio doesn't load Merlin data lazily as a default? Merlin data is easily memory mapped so it should not be problem and in my opinion a preferred behaviour. The only one I see the memory mapping will not work is if the data was binary and RAW - 1-bit packed.
@matkraj I don't know if we want to set multiple defaults for if a signal is lazy or not lazy as it starts to get confusing. Memory mapping in my opinion is easy to do but you can fairly easy run into problems. You can see the discussion here https://github.com/hyperspy/rosettasciio/pull/174#pullrequestreview-1686535220 but the jist is that memory mapping works best when reading linearly and when you start to increase the size of the dataset you run into an increasing number of problems.
Numpy likes to keep things as memory mapped arrays until it can't and then it will transform the data to memory which is hard to predict and requires an in-depth understanding of numpy. For lazy data I'd much rather just transform it to a dask array and have explicit control over some of these operations. In reality no-one should be using some format like .mib to actually do data processing. The first thing you should do is convert the data to .zspy and (then probably throw away the raw data or compress it if you feel sentimental).
Just to give an extreme example I often have .mrc files that are something like 400x256x256x128x128 (time, x, y, kx, ky) in which case reading along the time axis is painful because you have to jump around so much on the disk :) In that case it would be faster to create a dask array chunked in (y, kx,ky) save the data in a compressed format, load the data and rechunk it along the time axis, resave it and then operate along the time axis. I realize that sounds crazy but there isn't really a way around it :)
Hi @CSSFrancis thanks for pointing this out for me. When I was working on GPU processing in past I preferred any geometry adjustments to be done after the data was processed but I recognise this will be impacted by the type of processing needed. For template matching this is trivial but ptychography might be heavily impacted by this.
I was wondering if the default behaviour should be able to capture more usage variants, i.e. not having enough RAM, being able to visualise the data quickly with .plot() etc.
Hi @CSSFrancis thanks for pointing this out for me. When I was working on GPU processing in past I preferred any geometry adjustments to be done after the data was processed but I recognise this will be impacted by the type of processing needed. For template matching this is trivial but ptychography might be heavily impacted by this.
Yea that's an important part and I think I've struggled to disseminate this idea in the community! In general better lazy education is something that is necessary in the community. Honestly there are not great answer to questions such as:
- What is the ideal chunking for a 4-D STEM dataset?
- How should data be efficiently saved/ loaded?
- How should data be processed efficiently ?
Honestly the best solution is to just save things using zarr as compression just fixes so many of these questions. Just keep buying more and more and more hard drives, set up an array and at some point you get I-O speeds of like 1-10s GB/sec times what ever compression factor you get. At that point I-O is no longer your bottle neck :). Hard drives are cheap, 4D STEM datasets are large.
I was wondering if the default behaviour should be able to capture more usage variants, i.e. not having enough RAM, being able to visualise the data quickly with .plot() etc.
I feel like this goes with above. I'd default to non lazy operations and then if people want lazy they can pass lazy=True we could be better about how the lazy data is chunked and you can add distributed loading if you want.
Hi all, Is there any reason hyperspy/rosettasciio doesn't load Merlin data lazily as a default? Merlin data is easily memory mapped so it should not be problem and in my opinion a preferred behaviour. The only one I see the memory mapping will not work is if the data was binary and RAW - 1-bit packed.
If you want to acess the memory memmap directly, there is a lower level function in https://hyperspy.org/rosettasciio/user_guide/supported_formats/quantumdetector.html#rsciio.quantumdetector.load_mib_data that return it.
Hi @CSSFrancis thanks for pointing this out for me. When I was working on GPU processing in past I preferred any geometry adjustments to be done after the data was processed but I recognise this will be impacted by the type of processing needed. For template matching this is trivial but ptychography might be heavily impacted by this.
Just to note that the navigation_shape can be used to specify the navigation shape and therefore load the data as a stack of images using navigation_shape = (number_of_frames,). It is a bit annoying because it is necessary to know the number of frames when reading the data but maybe this can be fixed by not doing any data reshape for a given value of navigation_shape. I can't really think of what would be good value for that, maybe -1?