kikuchipy icon indicating copy to clipboard operation
kikuchipy copied to clipboard

Lazy and Dask Improvements

Open CSSFrancis opened this issue 1 year ago • 2 comments

There are a couple of things that I wanted to bring up and possibly offer solutions. @hakonanes maybe you can tell me if these are reasonable or not.

##Overlap causing RAM Spikes

There are a couple of instances where the dask.overlap function is used. In my personal experience this function suffers heavily from "the curse of dimensionality". In the worst case scenario where you have a 4D dataset that is chunked in every dimension each chunk is also loading 81 other chunks or 3^n_chunks! This basically means that:

  1. Your entire dataset is being loaded into RAM
  2. You are spending much of your time transferring chunks from worker to worker. (which is bad)

Even with only the x and y dimensions chunked each chunk is loading 9 chunks. While this is better and dask has recently improved how it handles this situation more than likely you will still see a large memory usage spike. I would suggest having a warning that pops up with every function that uses overlap to the effect of

"Warning: For optimal performance and RAM usage only one dimension should be chunked. Best practice is to call s.rechunk(nav_chunks=("auto",-1)). Saving and loading the data can also help as that optimizes the chunks saved on the disk"

##Loading Data Lazily

I would highly recommend adding a zarr format to to kikchipy. It is probably the easiest way to speed things up a bit. It also would let you load data and use dask_distributed. Otherwise you can read binary data in a way that can be applied to distributed computing.

##End to End Lazy computations

It would be convenient to not call the compute function in lazy computations. The idea here is that you could create a lazy Orix CrystalMap object that you could slice and compute only part of. That might be difficult but could potentially cause pretty large improvements.

CSSFrancis avatar Mar 21 '23 15:03 CSSFrancis