ilastik
ilastik copied to clipboard
Need better heuristic for default value of LAZYFLOW_THREADS
Lou Scheffer was experiencing poor interactive performance during pixel classification on a large 2D image. But on my machine, I get good performance, even using the same project file. The most notable difference between our two systems is the number of CPUs: Lou has 48 (presumably, half are hyper-threads).
We timed (with a stopwatch) how long it took to completely predict all tiles of a 12577x750 image using various settings of LAZYFLOW_THREADS
. We found that the benefit to using more CPUs disappears rather quickly. Maximum performance is achieved around 4-8 CPUs. Using all 48 CPUs is nearly as bad as using just a single thread.
I believe that our upcoming switch to 512px tiles will improve the situation, but we should still come up with a better heuristic for how to set LAZYFLOW_THREADS
if the user hasn't set it themselves. Otherwise, users with beefy workstations are likely to experience worse performance than users with modern laptops.
LAZYFLOW_THREADS |
Time (seconds) |
---|---|
1 | 95 |
2 | 63 |
4 | 37 |
8 | 35 |
16 | 41 |
32 | 60 |
48 (max) | 72 |
Image Size: (12577, 750)
Features: All
Prediction classes: 2
Viewer tile size: 256x256
OS: Fedora 20
ilastik version: 1.2.0
There is some hope that the situation will magically improve when we upgrade to Python 3, since it uses the "New GIL" implementation:
http://www.dabeaz.com/python/NewGIL.pdf
But it's difficult to say for sure.
I'm writing it here, since it might be related and the old issue is closed now. Our switch to 512x512 tiles for 2D gave a very noticeable speed-up there, but, as I just found, also a very noticeable slow-down for 3D data. First of all, we should make the tile size conditional on the dataset dimensions, but it would be good to understand this behaviour in principle.
Looks like the default behavior only got worse in the new version. Here is a screenshot of my machine doing autocontext. The whole thing feels slow and most of the time it's running less than 8 cores and at less than 100 %.
So I did some benchmarking on a machine with 20
cores, 40
threads.
Task was pixel classification on some cremi sample dataset. In addition to varying n_threads
, I also varied the amount of RAM for lazyflow, with some interesting results:
In short, I could reproduce the behavior that @stuarteberg showed in the first post. However, with more RAM, ilastik could take more and more threads, without slowing down.
I guess we see three effects:
- With small amounts of ram and many threads, we get very small block sizes -> halo slows us down
- With a small number of threads and loads of ram, threads will start concurrently and finish approximately at the same time. When writing occurs, the other threads are also more or less finished and wait for writing to finish.
- at some point (all threads more or less running all the time, one thread writing all the time) we run into saturation: can only be overcome by parallel writing.
So in the end, we should find a better heuristic to set both, LAZYFLOW_THREADS
and LAZYFLOW_TOTAL_RAM_MB
, jointly.
This issue has been mentioned on Image.sc Forum. There might be relevant details there:
https://forum.image.sc/t/notable-memory-usage-difference-when-running-ilastik-in-headless-mode-on-different-machines/41144/2
This issue has been mentioned on Image.sc Forum. There might be relevant details there:
https://forum.image.sc/t/cpu-and-ram-core-limit-for-ilastik/52428/2
This issue has been mentioned on Image.sc Forum. There might be relevant details there:
https://forum.image.sc/t/multiphase-segmentation-and-other-questions/78696/4