lit-llama icon indicating copy to clipboard operation
lit-llama copied to clipboard

changing `devices` to `fabric.world_size` in the pretrain code

Open LamOne1 opened this issue 1 year ago • 1 comments

Hello,

according to our discussion here, I think devices should be changed in the pretraininig code to fabric.world_size, since the batch size refers to the global batch size. devices in the code is equal to the value of GPUs in a single node. process_batch_size = batch_size // fabric.world_size

I believe the same thing goes for max_iters = 600000 # num_epochs * (epoch_size // micro_batch_size) // devices

LamOne1 avatar Jun 08 '23 05:06 LamOne1

Hi @LamOne1 The suggestion sounds good to me for process_batch_size = batch_size // fabric.world_size. The reason it was not done for Shakespeare is that multi-machine training is not really needed for this amount of data. Since the redpajama was based on the same script, it was carried over. In any case, using the world size would be correct in the general case.

For the max_iters, honestly I think it should be kept as "infinite" for practical reasons, but I'm fine with either if it doesn't complicate things.

awaelchli avatar Jun 08 '23 14:06 awaelchli