cellpose
cellpose copied to clipboard
Increasing Cellpose throughput
Dear Cellpose team, thanks for your great tool.
We love the quality of segmentation we obtain by using Cellpose
. Currently, we are trying to figure out whether we can increase throughput to saturate our GPUs, without having to run multiple instances of Cellpose.
We use roughly the following code:
from cellpose import io, models
model = models.Cellpose(gpu=True, model_type='nuclei')
images = [io.imread(x) for x in paths_of_2D_images]
_ = model.eval(images, ..., batch_size=10000)
Apparently, the batch_size
parameter does not change throughput (and for 10000
I'd actually expect OOM errors). If I understand Cellpose' execution flow correctly e.g., here, a batch is processed so that the model.eval()
is recursively called and each image in the batch processed individually (batch_size
variable is not used in the recursion anchor).
-
Is it possible to perform segmentation on whole minibatches simultaneously, instead of iterating through individual images? If so, this should result in significantly better performance.
-
Orthogonally, in other parts of the code, I believe performance can be improved by simpler means. For example, the
UnetModel._run_tiled(()
iterates over tiles of images that are extracted by fancy indexing which creates copies of data, instead of views. The following should result in slightly faster execution and lower memory footprint (slice
instead ofnp.arange
):
for k in range(niter):
irange = slice(batch_size*k, batch_size*k+batch_size)
y0, style = self.network(IMG[irange], return_conv=return_conv)
y[irange] = y0
(I can create a PR for this, if there is interest).
@VarIr that's great! Have you quantified the performance difference with the simple change (2) that you proposed?
Not yet, this just caught my eye while reading the code. I'll have a closer look and actually implement & benchmark this.
On a closer look UnetModel.eval()
seems broken. For example, there are undefined variables used here (channel_axis, z_axis), and here (nolist). That is, the function raises a NameError. I suppose, vanilla UnetModel is currently used nowhere in cellpose, but only CellposeModel
which inherits from UnetModel and overrides eval()
.
I'm wondering whether UnetModel.eval()
could be made an abstract function when it's apparently unused?
@VarIr You are correct, I hadn't noticed that before. Not sure is there would be any benefit to it, but perhaps. Does this get in the way of the optimizations you described earlier?
Imo, the benefit would be reduced maintenance effort in cellpose development. The optimization proposal (2) would be obsolete then. Users will anyway see no difference. Proposal (1) about segmentation of minibatches is unrelated to that, and I think this is also more important and could give substantial speed-ups.
thanks @VarIr if you see a speed-up please make a PR
I'm experiencing the same issue with the batch size. I have a a GPU with a lot of memory and I would love to load batches rather than going one by one.
closing due to inactivity
Are there any plans to reopen work on this issue?
We are also planning on using cellpose for segmenting large time course datasets (small image tiles but lots of tiles) and have very slow performance since we are not saturating our GPU.
Dear Cellpose team, thanks for your great tool.
We love the quality of segmentation we obtain by using
Cellpose
. Currently, we are trying to figure out whether we can increase throughput to saturate our GPUs, without having to run multiple instances of Cellpose.We use roughly the following code:
from cellpose import io, models model = models.Cellpose(gpu=True, model_type='nuclei') images = [io.imread(x) for x in paths_of_2D_images] _ = model.eval(images, ..., batch_size=10000)
Apparently, the
batch_size
parameter does not change throughput (and for10000
I'd actually expect OOM errors). If I understand Cellpose' execution flow correctly e.g., here, a batch is processed so that themodel.eval()
is recursively called and each image in the batch processed individually (batch_size
variable is not used in the recursion anchor).
- Is it possible to perform segmentation on whole minibatches simultaneously, instead of iterating through individual images? If so, this should result in significantly better performance.
- Orthogonally, in other parts of the code, I believe performance can be improved by simpler means. For example, the
UnetModel._run_tiled(()
iterates over tiles of images that are extracted by fancy indexing which creates copies of data, instead of views. The following should result in slightly faster execution and lower memory footprint (slice
instead ofnp.arange
):for k in range(niter): irange = slice(batch_size*k, batch_size*k+batch_size) y0, style = self.network(IMG[irange], return_conv=return_conv) y[irange] = y0
(I can create a PR for this, if there is interest).
I quickly tested (2) and it gave me a 5X increase in speed on the same dataset. In case you want help implementing @VarIr I'd be happy to help.
thanks for testing that fix in 2! Would be happy to accept a pull request for it. For number 1 we'll have to think about how best to do this. I agree it would be faster but it would require some refactoring.
Sorry I never followed up on (2), because at the time it seemed the code was unreachable. If you already have an implementation @sophiamaedler, feel free to open the PR.
Regarding (1), back then I also realized it would require quite some restructuring, introducing Datasets and DataLoaders. Still I think this would be worthwhile, not only to better saturate single GPUs, but it's also a good starting point to allow for the introduction of (Distributed)DataParallel for multi-GPU training/inference as a subsequent step.
I'll probably not find the time to implement everything myself, but would love to help, if this effort could be shared.
Was (1) ever implemented? @carsen-stringer