itkwidgets icon indicating copy to clipboard operation
itkwidgets copied to clipboard

Add dask array support

Open thewtex opened this issue 6 years ago • 9 comments

Stages:

  1. [x] Basic support for dask.array
  2. [ ] Stream dask.array
  3. [ ] Use dask.delayed or dask.futures to help downsample images.

CC: @mrocklin @dani-lbnl xref #43 https://github.com/scisprints/2018_05_sklearn_skimage_dask/issues/12

thewtex avatar May 25 '18 23:05 thewtex

The coarsen function might be relevant here

http://dask.pydata.org/en/latest/array-api.html#dask.array.coarsen

On Fri, May 25, 2018, 4:45 PM Matt McCormick [email protected] wrote:

Stages:

  1. Basic support for dask.array
  2. Stream dask.array
  3. Use dask.delayed or dask.futures to help downsample images.

CC: @mrocklin https://github.com/mrocklin @dani-lbnl https://github.com/dani-lbnl xref #43 https://github.com/InsightSoftwareConsortium/itk-jupyter-widgets/issues/43 scisprints/2018_05_sklearn_skimage_dask#12 https://github.com/scisprints/2018_05_sklearn_skimage_dask/issues/12

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/InsightSoftwareConsortium/itk-jupyter-widgets/issues/44, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszI3FbtJ66NHWn1cuorIXbBs5G3Hlks5t2JeugaJpZM4UOsT5 .

mrocklin avatar May 26 '18 00:05 mrocklin

@mrocklin good suggestion regarding coarsen! :+1:

Some experiments are summarized here:

https://gist.github.com/thewtex/079c59ea71a1da59bea5e14cb6b73d1f

For large image support it looks like we should not convert NumPy arrays, ITK Images, etc. to Dask for downsampling because of the extra memory required, the time required for chunking, and the performance of downsampling. However, I will look into adding first class Dask array support by using the Dask arrays directly for large images and coarsen. This will remove the extra memory and time required to allocate the memory.

CC: @jakirkham

thewtex avatar Oct 29 '18 04:10 thewtex

In general I agree that the main advantage here is probably for larger-than-memory data. However, there are likely several things you can do to accelerating things on the Dask side here. For example you might pass name=False to from_array to avoid that cost (it's hashing the data to find a deterministic key that you don't care about).

@hmaarrfk might have some other suggestions.

I'm not sure if I'm reading the notebook correctly, but it seems like you have three approaches that perform as follows. Is this right?

  • Numpy 40s
  • Dask array 3s
  • Some ITK specialized code 1.5s

If this is correct then I'm curious, what does the ITK specialized code do differently than Numpy?

mrocklin avatar Oct 29 '18 11:10 mrocklin

It seems to me you have a 6.7 GB dataset you are working on. I feel your pain. ITK is going at about 4GB/s. You are shrinking by 10, 10, 10 is that correct?

I've found numpy's code to be pretty slow and inefficient for large images. Slowdowns typically occur from a few places:

  • Generous calls to np.ascontiguousarray which do wonders for small data (copying it is almost free) but which are very expensive for larger nd arrays.
  • np.pad and np.block were optimized for small arrays where repeated concatenations is actually must faster than building up the metadata required for single copy padding/blocking. Please chime in here https://github.com/numpy/numpy/pull/11358 to express support as some believe that this is a "micro optimization". As a note, an np.block algorithm for large arrays just got in, so the performance of that one should be much better in 1.16 for arrays larger than 512x512.

The reason this is important is scikit-image calls np.pad to pad the array before downscale_local_means. @lagru seems to have wrtiten _fast_pad for scikit-image which implements the algorithm above, but only for pad widths of 1. I would support any PR that extends that functionality for width>1, but I would much prefer numpy actually update their algorithm with @lagru's work.

Since you are downsampling by 10, and some of the dimensions are 2048, it will pad to get that extra 2 points, and thus, copy the array twice, once for each dimension it needs to pad.

Finally, the downscale local means function is using, in my mind, a very inefficient algorithm.

Eventually it will calls this function https://github.com/scikit-image/scikit-image/blob/972ce191ba6361356de17274bcf7ef2bb4d989e8/skimage/measure/block.py#L77 and for your 3D image, create a 6D image where the strides are arranged in such a way to have the 0:3 dims be equal to your final dimension, and 3:6 be what you want to average over. But, it will make 3 calls to np.mean instead of 1, and thus, allocate 3 large arrays in the process. I'm not even sure if np.mean supports non-contiguous arrays. You better hope it doesn't copy. It really might, so you have to check.

See this PR https://github.com/scikit-image/scikit-image/pull/3522 this fails the median test, which was almost surely wrong. Help writing a benchmark would be appreciated!

dask provides a trim_excess parameter that seems like an interesting way to deal with this excess that can probably be integrated in scikit-image.

Summarizing the places where I see slowdowns:

  1. you are downsampling 2048, 2048, 1600 by 10, 10, 10 which causes the array to get padded. The input array should be contiguous and its shape should be divisible by the shrink factors.
  2. Try my dev branch of skimage.
  3. If you are on linux, try numpy dev branch to make use of kernel optimizations for large memory allocs.

PS, that you can probably speedup your large array creating by following some of the not so much hacks suggested here: https://github.com/numpy/numpy/issues/11919 though these shouldn't be so necessary in numpy 1.16.

Dask seems to a similar strategy to numpy: reshape the array (free), then call mean once over the new shape Your dask code could be optimized in a few places:

  1. I got hung up on the name parameter too. See discussion: https://github.com/dask/dask/issues/3946
  2. use persist. It is unlikely that you care about creating a final contiguous matrix (unless you do).
  3. Remove the call to astype. This is pretty contentious in scikit-image. I've tried to see if using smaller dtypes would help, it just depends on how the compiler can optimize them. It isn't obvious that Cython is the right tool to ensure that fast operations occur on uint8.

Finally, if you really want to hack, provide a dtype to coarsen and to local_means and feed in float32.

hmaarrfk avatar Oct 29 '18 12:10 hmaarrfk

How is dask shriking 256x256 by 10x10? Is it trimming each chunk? or just the whole?

hmaarrfk avatar Oct 29 '18 12:10 hmaarrfk

In general I agree that the main advantage here is probably for larger-than-memory data.

Yes, the prospect of eventually handling larger-than-memory with a distributed system is exciting :exclamation:

However, there are likely several things you can do to accelerating things on the Dask side here. For example you might pass name=False to from_array to avoid that cost (it's hashing the data to find a deterministic key that you don't care about).

Ah, that is good to know! Yes, adding name=False makes the distribution in a Dask Array much, much faster (10 sec to 4.2 ms).

If this is correct then I'm curious, what does the ITK specialized code do differently than Numpy?

Right, this is a compile-time, optimized-to-the-task implementation. More details are here, PDF.

thewtex avatar Oct 30 '18 04:10 thewtex

It seems to me you have a 6.7 GB dataset you are working on. I feel your pain. ITK is going at about 4GB/s. You are shrinking by 10, 10, 10 is that correct?

Yes, this was a rough previous to understand if we can get interactive performance when dynamically downsampling for visualization (first implementation in #98).

thewtex avatar Oct 30 '18 04:10 thewtex

The reason this is important is scikit-image calls np.pad to pad the array before downscale_local_means. @lagru seems to have wrtiten _fast_pad for scikit-image which implements the algorithm above, but only for pad widths of 1. I would support any PR that extends that functionality for width>1, but I would much prefer numpy actually update their algorithm with @lagru's work.

Since you are downsampling by 10, and some of the dimensions are 2048, it will pad to get that extra 2 points, and thus, copy the array twice, once for each dimension it needs to pad.

Good points, @hmaarrfk .

dask provides a trim_excess parameter that seems like an interesting way to deal with this excess that can probably be integrated in scikit-image.

:+1:

The padding also generates "artifacts" on the border for this this use case.

thewtex avatar Oct 30 '18 04:10 thewtex

you are downsampling 2048, 2048, 1600 by 10, 10, 10 which causes the array to get padded. The input array should be contiguous and its shape should be divisible by the shrink factors.

For this use case, we have to accept all data as it comes. All data is welcome :-).

PS, that you can probably speedup your large array creating by following some of the not so much hacks suggested here: numpy/numpy#11919 though these shouldn't be so necessary in numpy 1.16.

Nicely done!

2\. use `persist`. It is unlikely that you care about creating a final contiguous matrix (unless you do).

3\. Remove the call to `astype`. This is pretty contentious in scikit-image. I've tried to see if using smaller dtypes would help, it just depends on how the compiler can optimize them. It isn't obvious that Cython is the right tool to ensure that fast operations occur on uint8.

Finally, if you really want to hack, provide a dtype to coarsen and to local_means and feed in float32.

Good ideas, but these were unfortunately constraints from the use case.

Thanks for the reviews @hmaarrfk @mrocklin !

thewtex avatar Oct 30 '18 05:10 thewtex