How is micro-batch-size influencing the throughput per GPU ?

Open bugm opened this issue 1 year ago • 0 comments

Hi, I am testing how is micro-batch-size influencing the throughput per GPU with a constant global-batch-size. The result shows that as the micro-batch-size increases, the throughput per GPU(TFLOP/s/GPU) also increases. I have done some test with a 400M transformer based model on 2 A40 GPUS, and only use data parallelism. Here are some training Arguments With different test I only change the micro-batch-size , trained on 100 iterations with seq_len =1024 and global-batch-size =24 . Here are some result with different micro-batch-size I print the log every 5 iterations and compute the averaged throughput per GPU. For each Iteration , the total computational complexity is the same , but throughput per GPU increases as the micro-batch-size increases. I know that may related to the GPU cache load or arithmetic intensity but not quite clear. Can anyone provide some in-depth explanations?

Oct 12 '24 08:10 bugm