pyclesperanto_prototype icon indicating copy to clipboard operation
pyclesperanto_prototype copied to clipboard

Kernel crashes when processing image in tiles using dask (parallelization issue)

Open haesleinhuepf opened this issue 3 years ago • 8 comments
trafficstars

Dear future-self,

I'm experiencing kernel crashes when processing a big image in tiles. It is likely related to dask's multi-threading / parallel computing and OpenCL.

The issue can be reproduced on an Apple Mac M1 Max when executing this notebook on the full image size (comment out this one line in the second block): https://github.com/haesleinhuepf/BioImageAnalysisNotebooks/blob/main/docs/32_tiled_image_processing/tiled_nuclei_counting.ipynb

A workaround for making the tiled image processing work is to call this line which will deactivate asynchronous execution of OpenCL kernels:

cle.set_wait_for_kernel_finish(True)

I'm not sure if fixing this is easily possible. It might be easier to suggest users to do parallel computing with separate OpenCL-contexts. See also related: https://github.com/clEsperanto/pyclesperanto_prototype/pull/129

CC @StRigaud (no action item here Stephane, just to let you know)

Cheers, past-self

haesleinhuepf avatar Feb 02 '22 11:02 haesleinhuepf

See also #144

haesleinhuepf avatar Feb 06 '22 15:02 haesleinhuepf

So with the cell commented out: image I should get a crash? Doing Run All works fine and results seem correct... I've uncommented and everything is also fine.

I suspect this may be related to your emulated environment—mines arm64 native (I've also used a fair bit of dask with pyclesperanto and stardist and pyopencl and it's been fine...).

psobolewskiPhD avatar Feb 20 '22 19:02 psobolewskiPhD

Hey @psobolewskiPhD ,

it's possible that this only happens on GPUs that are capable of parallel execution of kernels. I know that Intel's and AMD's integrated GPUs are not affected. NVidia dedicated GPUs are affected. Not sure about Apple M1 though.

Best, Robert

haesleinhuepf avatar Feb 20 '22 19:02 haesleinhuepf

Well, we both have M1 right? difference is previously you mentioned your env was in emulation and mine is native.

psobolewskiPhD avatar Feb 20 '22 19:02 psobolewskiPhD

Ok, I just ran the notebook on the large image with kernel synchronization off. The kernel crashed. So M1 is also affected, at least in the x64-compatiblity mode.

haesleinhuepf avatar Feb 20 '22 19:02 haesleinhuepf

Which image? uncropped or do I need to download something else? I'm quite curious regarding the differences between native and emulated (accounting for the fact mine is just vanilla M1 and yours is Max).

BTW: maybe you can try running the bench from here: https://github.com/clEsperanto/pyclesperanto_prototype/issues/136

psobolewskiPhD avatar Feb 20 '22 19:02 psobolewskiPhD

Which image? uncropped or do I need to download something else?

The notebook linked above should work on all computers. When you comment out these two lines, you start running in trouble:

image = image[1000:1500, 1000:1500]
cle.set_wait_for_kernel_finish(True)

How hard is it btw to setup an arm-based environment? I have never tried...

haesleinhuepf avatar Feb 20 '22 20:02 haesleinhuepf

Gonna run the notebook now.

Re: native env, it's easy—just use conda-forge miniforge (this is what Apple recommends too) or equivalently mambaforge. The question is how to maintain x86 and arm64. I did some looking into that, but never followed through—need to go back to my notes.

Edit: I changed chunks to 1000, 1000 and with or without the kernel_finish it's fine. Will try smaller chunks. Edit3: Meaning: image

Edit2: smaller chunks are fine too, just slow. 250, 250 makes the tile_map.compute() step take 50s—all on GPU tho, so everything is functional and fine. Quite incredible honestly.

Edit4: here's what was in my notes regarding x86 and arm64 coexisting: https://stackoverflow.com/questions/65415996/how-to-specify-the-architecture-or-platform-for-a-new-conda-environment-apple Sorry, not much. Like I said, never got around to making x86 python. Not like you can emulate CUDA anyways 🤣 Happy to use other channels or whatnot for the x86/arm64 discussion.

Edit5: what's interesting is that the wait_for_kernel_to_finish flag doesn't change speed of execution. And the print statements seem to reflect some degree of async regardless.

psobolewskiPhD avatar Feb 20 '22 20:02 psobolewskiPhD