Seeing no performance gains with larger batch_size on 3D images
Hello,
I am using the distributed cellpose module to run inference on a terabyte-scale 3D image (~5000x10000x10000 ZYX). I would like to use as much RAM + VRAM as possible to reduce the inference time. I have a combination of GPU nodes on an SGE cluster: some L40s with 40GB VRAM and some A100s with 80GB. I am using a blocksize of 768^3, which results in about 500 blocks for my dataset after filtering empty blocks with a foreground mask.
I have tried different values for the batch_size, but all values result in the same runtime per block. I do see that the larger batch_sizes are using more VRAM, so I am a bit confused why the per-block inference time does not change.
I tested this on a smaller-than-RAM 3D image (256x128x128 ZYX) and observed the same thing. There was no difference in runtime using batch_size 1, 4, 8, 16, 32, or 64.
Is this expected behavior, and am I misunderstanding how batch_size works?
Thanks!
Hi @vbrow29, increasing batch_size will put more data through the net at once, but the runtime is often affected by the post-processing steps, which run on the whole slice/volume.
We're continuing to work on improving memory usage and runtime, so there may still be some bugs.