cellpose icon indicating copy to clipboard operation
cellpose copied to clipboard

Increasing Cellpose throughput

Open VarIr opened this issue 3 years ago • 13 comments

Dear Cellpose team, thanks for your great tool.

We love the quality of segmentation we obtain by using Cellpose. Currently, we are trying to figure out whether we can increase throughput to saturate our GPUs, without having to run multiple instances of Cellpose.

We use roughly the following code:

from cellpose import io, models
model = models.Cellpose(gpu=True, model_type='nuclei')
images = [io.imread(x) for x in paths_of_2D_images]
_ = model.eval(images, ..., batch_size=10000)

Apparently, the batch_size parameter does not change throughput (and for 10000 I'd actually expect OOM errors). If I understand Cellpose' execution flow correctly e.g., here, a batch is processed so that the model.eval() is recursively called and each image in the batch processed individually (batch_size variable is not used in the recursion anchor).

  1. Is it possible to perform segmentation on whole minibatches simultaneously, instead of iterating through individual images? If so, this should result in significantly better performance.

  2. Orthogonally, in other parts of the code, I believe performance can be improved by simpler means. For example, the UnetModel._run_tiled(() iterates over tiles of images that are extracted by fancy indexing which creates copies of data, instead of views. The following should result in slightly faster execution and lower memory footprint (slice instead of np.arange):

for k in range(niter):
    irange = slice(batch_size*k, batch_size*k+batch_size)
    y0, style = self.network(IMG[irange], return_conv=return_conv)
    y[irange] = y0

(I can create a PR for this, if there is interest).

VarIr avatar Feb 21 '22 11:02 VarIr

@VarIr that's great! Have you quantified the performance difference with the simple change (2) that you proposed?

kevinjohncutler avatar Feb 22 '22 10:02 kevinjohncutler

Not yet, this just caught my eye while reading the code. I'll have a closer look and actually implement & benchmark this.

VarIr avatar Feb 22 '22 11:02 VarIr

On a closer look UnetModel.eval() seems broken. For example, there are undefined variables used here (channel_axis, z_axis), and here (nolist). That is, the function raises a NameError. I suppose, vanilla UnetModel is currently used nowhere in cellpose, but only CellposeModel which inherits from UnetModel and overrides eval().

I'm wondering whether UnetModel.eval() could be made an abstract function when it's apparently unused?

VarIr avatar Feb 22 '22 13:02 VarIr

@VarIr You are correct, I hadn't noticed that before. Not sure is there would be any benefit to it, but perhaps. Does this get in the way of the optimizations you described earlier?

kevinjohncutler avatar Mar 11 '22 23:03 kevinjohncutler

Imo, the benefit would be reduced maintenance effort in cellpose development. The optimization proposal (2) would be obsolete then. Users will anyway see no difference. Proposal (1) about segmentation of minibatches is unrelated to that, and I think this is also more important and could give substantial speed-ups.

VarIr avatar Mar 12 '22 09:03 VarIr

thanks @VarIr if you see a speed-up please make a PR

carsen-stringer avatar Apr 06 '22 20:04 carsen-stringer

I'm experiencing the same issue with the batch size. I have a a GPU with a lot of memory and I would love to load batches rather than going one by one.

greenmossball avatar Feb 03 '23 18:02 greenmossball

closing due to inactivity

carsen-stringer avatar May 09 '23 19:05 carsen-stringer

Are there any plans to reopen work on this issue?

We are also planning on using cellpose for segmenting large time course datasets (small image tiles but lots of tiles) and have very slow performance since we are not saturating our GPU.

sophiamaedler avatar May 16 '23 13:05 sophiamaedler

Dear Cellpose team, thanks for your great tool.

We love the quality of segmentation we obtain by using Cellpose. Currently, we are trying to figure out whether we can increase throughput to saturate our GPUs, without having to run multiple instances of Cellpose.

We use roughly the following code:

from cellpose import io, models
model = models.Cellpose(gpu=True, model_type='nuclei')
images = [io.imread(x) for x in paths_of_2D_images]
_ = model.eval(images, ..., batch_size=10000)

Apparently, the batch_size parameter does not change throughput (and for 10000 I'd actually expect OOM errors). If I understand Cellpose' execution flow correctly e.g., here, a batch is processed so that the model.eval() is recursively called and each image in the batch processed individually (batch_size variable is not used in the recursion anchor).

  1. Is it possible to perform segmentation on whole minibatches simultaneously, instead of iterating through individual images? If so, this should result in significantly better performance.
  2. Orthogonally, in other parts of the code, I believe performance can be improved by simpler means. For example, the UnetModel._run_tiled(() iterates over tiles of images that are extracted by fancy indexing which creates copies of data, instead of views. The following should result in slightly faster execution and lower memory footprint (slice instead of np.arange):
for k in range(niter):
    irange = slice(batch_size*k, batch_size*k+batch_size)
    y0, style = self.network(IMG[irange], return_conv=return_conv)
    y[irange] = y0

(I can create a PR for this, if there is interest).

I quickly tested (2) and it gave me a 5X increase in speed on the same dataset. In case you want help implementing @VarIr I'd be happy to help.

sophiamaedler avatar May 16 '23 16:05 sophiamaedler

thanks for testing that fix in 2! Would be happy to accept a pull request for it. For number 1 we'll have to think about how best to do this. I agree it would be faster but it would require some refactoring.

carsen-stringer avatar May 16 '23 16:05 carsen-stringer

Sorry I never followed up on (2), because at the time it seemed the code was unreachable. If you already have an implementation @sophiamaedler, feel free to open the PR.

Regarding (1), back then I also realized it would require quite some restructuring, introducing Datasets and DataLoaders. Still I think this would be worthwhile, not only to better saturate single GPUs, but it's also a good starting point to allow for the introduction of (Distributed)DataParallel for multi-GPU training/inference as a subsequent step.

I'll probably not find the time to implement everything myself, but would love to help, if this effort could be shared.

VarIr avatar May 17 '23 08:05 VarIr

Was (1) ever implemented? @carsen-stringer

Nespresso2000 avatar Apr 20 '24 10:04 Nespresso2000