pyxem icon indicating copy to clipboard operation
pyxem copied to clipboard

Synergies with LiberTEM, UDFs vs. map_blocks

Open din14970 opened this issue 3 years ago • 8 comments

I'm posting this mainly as a point of discussion.

In my experience, indexing real and large datasets (whichever method is used) is still quite slow, even with whatever I implemented in #673. These things should not take more than a few minutes on a cluster if functions could make use of GPU's, which is something I'd like to look into - e.g. with Cupy. More than that however is that I feel the functionality we write here to perform operations on datasets using dask.array.map_blocks requires a lot of boilerplate: first write a function to operate on a frame, then a function that loops over all the frames, then wrap that in another thing. After a discussion with Dieter Weber I recently started looking into LiberTEM's UDF API, which seems like a much more natural, granular and scaleable approach to operating on large datasets. I'm in the process of experimenting a bit with this to see if I can get implement "fast" template matching as a LiberTEM UDF, hopefully also with CUDA support, so we could run it on our GPU cluster.

I'm wondering if there are any longer term ideas about using the LiberTEM API inside Pyxem or porting the Pyxem functionality to LiberTEM. Hence opening the issue here for a discussion.

din14970 avatar Mar 02 '21 16:03 din14970

Doesn't the dask_tools._process_dask_array already do this? Example:

import hyperspy.api as hs
from scipy.ndimage import gaussian_filter
from pyxem.utils.dask_tools import _process_dask_array
s = hs.load("test_data.hspy", lazy=True) # 4-dimensional dataset
data_gaussian = _process_dask_array(s.data, gaussian_filter, sigma=20) # Get dask array as output and retain chunk sizes

It can easily be extended to a .map like functionality, for example via the https://github.com/hyperspy/hyperspy/pull/2617 pull request.

One advantage is that data_gaussian is a dask array with the chunks retained. So we can easily combine this wdth further processing, by "layering" additional processing steps on-top:

import pyxem as pxm
s_gaussian = pxm.signals.Diffraction2D(data_gaussian).as_lazy()
s_gaussian.unit = "2th_deg"
s_gaussian.set_ai()
s_ai = s_gaussian.get_azimuthal_integral1d(npt=100) # Still lazy signal
s_ai.save("test_azimuthal.hspy")

magnunor avatar Mar 02 '21 17:03 magnunor

There isn't a really a short answer to these questions, but the last time Dieter and I spoke the conclusion we reached was that there are a lot of unexploited synergies between the work being done with LiberTEM and hyperspy and pyxem. I think in the short-medium term they (LiberTEM) were hoping to improve hyperspy's lazy signal handling. As I've recently discovered @CSSFrancis also has pull requests open with hyperspy to a similar effect (upstreaming some of are dask processing code) and that seems to be the direction things are going.

In terms of the use case @din14970 is describing, if the functionality was rewritten into a LiberTEM UDF I would be very happy to merge it into pyxem. I have no objections to depending on LiberTEM (but similarly, there is little enthusiasm from either us or LiberTEM to "merge" the two packages into one) and it may well be faster than using dask for most users (although I've not personally run experiments to confirm this).

pc494 avatar Mar 02 '21 18:03 pc494

@magnunor Actually wasn't aware of dask_tools._process_dask_array so that's a nice tip. However, in my specific case I suspect your implementation would be slower because the _process_chunk function relies on a python for-loop which locks up the GIL when you call compute, whereas my "apply to chunk" function is njit-ed and should be able to bypass the GIL entirely, though I must admit I have not benchmarked it. The main points I am thinking about are:

  • scaleability to multi-cpu, multi-gpu, multi-node systems, which LiberTEM was designed to exploit from the start. I know dask is also made for this with distributed, and you could pass the dask array you created to a dask Client to compute it. LiberTEM as far as I understand also uses dask distributed under the hood. I'm wondering how vanilla dask using an array generated with dask_tools would measure up against LiberTEM.
  • analysis of stream data. While I don't know how it works in practice, the LiberTEM guys assured me LiberTEM can analyze data simultaneously as it is being written do disk at the microscope. I was thinking that would be an interesting platform to build live orientation indexing on for example.

@pc494

but similarly, there is little enthusiasm from either us or LiberTEM to "merge" the two packages into one

As far as I understood, LiberTEM is seeing themselves less and less as a 4D-STEM analysis package, as they are trying to focus on the speedy calculation aspect and trying to push the application specific stuff out to Hyperspy/Pyxem.

din14970 avatar Mar 02 '21 18:03 din14970

@pc494

but similarly, there is little enthusiasm from either us or LiberTEM to "merge" the two packages into one

As far as I understood, LiberTEM is seeing themselves less and less as a 4D-STEM analysis package, as they are trying to focus on the speedy calculation aspect and trying to push the application specific stuff out to Hyperspy/Pyxem.

Yes, that's certainly my impression, and it seems like an excellent way of allocating the skills we have at present in the community(ies)

pc494 avatar Mar 02 '21 19:03 pc494

We have plans to try out LiberTEM's UDFs in kikuchipy as well (https://github.com/pyxem/kikuchipy/issues/321), so I'm interested in seeing it goes, @din14970!

As Magnus mentions, the "interactivity" of dask array signals is really nice and practical. I'm thinking that using a combination of dask and LiberTEM might be good, in that LiberTEM can do some specialized computations, while dask does all the rest.

hakonanes avatar Mar 03 '21 08:03 hakonanes

With regards to _process_dask_array, it is currently implemented in a fairly "general" way. So that any function which processes a single 2D image can be used. However, there are probably several optimizations which can be done. For example, many functions can handle any number of dimensions. So the for loop could be skipped in many cases, by slightly rewriting the _process_chunk function. This is something which should be pretty quick to implement.

However, in my specific case I suspect your implementation would be slower because the _process_chunk function relies on a python for-loop which locks up the GIL when you call compute, whereas my "apply to chunk" function is njit-ed and should be able to bypass the GIL entirely, though I must admit I have not benchmarked it.

Give it a shot! I'm very curious to see how it handles it.

One thing to keep in mind when benchmarking very big datasets, is doing the "full" processing from the harddrive to the end result. It is sometimes possible to over-optimize things in one part of the processing, while the real bottleneck is somewhere else.

analysis of stream data. While I don't know how it works in practice, the LiberTEM guys assured me LiberTEM can analyze data simultaneously as it is being written do disk at the microscope. I was thinking that would be an interesting platform to build live orientation indexing on for example.

I did a little work on this earlier, with fpd_live_imaging. However, I haven't really had time to maintain it... https://fast_pixelated_detectors.gitlab.io/fpd_live_imaging/

magnunor avatar Mar 03 '21 16:03 magnunor

  • scaleability to multi-cpu, multi-gpu, multi-node systems, which LiberTEM was designed to exploit from the start. I know dask is also made for this with distributed, and you could pass the dask array you created to a dask Client to compute it. LiberTEM as far as I understand also uses dask distributed under the hood. I'm wondering how vanilla dask using an array generated with dask_tools would measure up against LiberTEM.

It's been a little while since I've looked at LiberTEM but it does seem like they have a lot better handle on dealing with large datasets than hyperspy does (partially because I don't think Hyperspy has really had to deal with them until recently). For operating using multiple cores on a cluster, I've recently been playing around with dask-jobqueue which might be of interest if you are trying to schedule jobs on a cluster through Dask.

I wouldn't say that https://github.com/hyperspy/hyperspy/pull/2617 is going to completely fix the map function in hyperspy but I think it is a good start. I'll try and benchmark it's ability to run in parallel hopefully later today.

CSSFrancis avatar Mar 03 '21 17:03 CSSFrancis

On this topic, dask seems to be playing very well with the RAPIDS and the ucx framework https://medium.com/rapids-ai/high-performance-python-communication-with-ucx-py-221ac9623a6a

Also, NEP35 will be very convenient to write code which works for cupy, numpy, dask/numpy or dask/cupy arrays. I have started experimenting with a branch of hyperspy which supports cupy array and it works well without changing much of the code.

In my experience, indexing real and large datasets (whichever method is used) is still quite slow, even with whatever I implemented in #673. These things should not take more than a few minutes on a cluster if functions could make use of GPU's, which is something I'd like to look into - e.g. with Cupy. More than that however is that I feel the functionality we write here to perform operations on datasets using dask.array.map_blocks requires a lot of boilerplate:

As @CSSFrancis said, this shouldn't be necessary, if the hyperspy map function was improved.

ericpre avatar Mar 04 '21 23:03 ericpre

Although this was never resolved, I am closing as old+discussion.

pc494 avatar Jan 16 '23 14:01 pc494