Seeing no performance gains with larger batch_size on 3D images

Open vbrow29 opened this issue 7 months ago • 1 comments

Hello,

I am using the distributed cellpose module to run inference on a terabyte-scale 3D image (~5000x10000x10000 ZYX). I would like to use as much RAM + VRAM as possible to reduce the inference time. I have a combination of GPU nodes on an SGE cluster: some L40s with 40GB VRAM and some A100s with 80GB. I am using a blocksize of 768^3, which results in about 500 blocks for my dataset after filtering empty blocks with a foreground mask.

I have tried different values for the batch_size, but all values result in the same runtime per block. I do see that the larger batch_sizes are using more VRAM, so I am a bit confused why the per-block inference time does not change.

I tested this on a smaller-than-RAM 3D image (256x128x128 ZYX) and observed the same thing. There was no difference in runtime using batch_size 1, 4, 8, 16, 32, or 64.

Is this expected behavior, and am I misunderstanding how batch_size works?

Thanks!

Jun 08 '25 04:06 vbrow29

Hi @vbrow29, increasing batch_size will put more data through the net at once, but the runtime is often affected by the post-processing steps, which run on the whole slice/volume.

We're continuing to work on improving memory usage and runtime, so there may still be some bugs.

Jun 12 '25 16:06 mrariden