torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Batch size changes with number of processes

Open pbontrager opened this issue 1 year ago • 0 comments

Currently our batch size is a local batch size. This means with a bs=4, if you launch on 4 gpus then each gpu gets 4 data points and your real (global) batch size is 16. This is problematic since the learning rate should be scaled with batch size changes. There are two options:

  1. batch_size -> local_batch_size: we leave the batch size math up to the user and just clearly indicate that it's a local bs
  2. local_batch_size = batch_size // world_size: This allows running recipes across different machine setups without having to update the config values but you have to deal with rounding.

pbontrager avatar Jan 22 '24 20:01 pbontrager