Koyaanisqatsi icon indicating copy to clipboard operation
Koyaanisqatsi copied to clipboard

some simple optimizations and swing image display speedup (setRGB is …

Open automenta opened this issue 7 years ago • 8 comments

…synchronized so use the int[] raster, much faster)

:+1:

automenta avatar Aug 14 '17 13:08 automenta

Thanks for the effort. I would accept a small change around 'imgData'.

OlegMazurov avatar Aug 16 '17 03:08 OlegMazurov

yes that is the most important part

btw it might be interesting to compare it with this

https://github.com/automenta/aparapicellular/blob/master/src/main/java/com/aparapi/cellular/ConwayLifeKernel.java#L51

from

https://github.com/automenta/aparapicellular#aparapi---cellular-automata

automenta avatar Aug 16 '17 10:08 automenta

In my taxonomy, the aparapi approach falls into the "synchronous parallel" category. While GPU hardware parallelism provides significant performance boost the need to synchronize on a barrier between generations is still there and my point is that eventually it kills scalability. My implementation is really just a proof-of-concept. I'm not positioning it as a faster way to run Life (there are better approaches to that), and that's why I'm not interested in trading off core algorithm optimizations for even more complexity. However, from the scalability perspective it already looks promising (to demonstrate that one needs more CPUs/cores that a typical laptop has, though).

OlegMazurov avatar Aug 16 '17 16:08 OlegMazurov

you may be right, im not sure if opencl fully supports asynch compute. in opencl this might be the closest possible: http://aparapi.com/documentation/explicit-buffer-handling.html

but a friend told me that Vulkan is designed for asynch cases. ill have to look into that

automenta avatar Aug 16 '17 18:08 automenta

"Asynchronous" is an overloaded term. I explain what I mean by it in the context of Life implementations. In the GPU context, my expectation is that asynchronous computation means that submitting tasks to GPU is disentangled from their execution but the developer will have to synchronize tasks if there is data dependency, which means there will have to be a barrier between generations anyway. A more complicated implementation might be possible but the straightforward one won't benefit from asynchronicity.

I slightly changed aparapicellular to measure sustained fps when switching between GPU and CPU. On my Mac with 4 CPU cores and 384 GPU cores I'm getting ~90 fps in CPU mode and ~610 fps in GPU mode. Let's say that's approximately 610/(90/4) = 27 times faster than a single-threaded execution. I understand it's not exactly apples-to-apples but I observed speedup ~300x for my implementation on some serious hardware (256 cores + HT). It processed 1850 generations per second but the number of cells was 1.6 times less.

OlegMazurov avatar Aug 16 '17 19:08 OlegMazurov

could you try running aparapi cellular in CPU mode (native OpenCL, not JVM) on the 256 core hardware?

i cant say for sure if my alife implementations there in aparapi cellular are anywhere near optimal since it was my first experiment with aparapi. so there are potentially better ways of doing it, as well as potential improvements in aparapi and drivers.

automenta avatar Aug 16 '17 19:08 automenta

That was a headless server class machine, not even Intel and no GPU (but running Java).

OlegMazurov avatar Aug 16 '17 23:08 OlegMazurov

it could probably still work if there exist cpu opencl drivers for the architecture

for example POCL has cpu-only support and looks relatively portable http://pocl.sourceforge.net/docs/html/install.html#requirements

automenta avatar Aug 17 '17 01:08 automenta