litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

GPU memory calculator

Open awaelchli opened this issue 1 year ago • 1 comments

When scaling models to multiple GPUs, it can be difficult to estimate the memory requirement in advance, especially since it is a function of:

  • batch size
  • optimizer type
  • precision settings
  • number of parameters
  • model architecture / hyperparameters
  • FSDP settings
  • and more

We could provide a simple calculator tool that given a model config from lit-gpt, and a set of the parameters listed above, can compute the per GPU memory requirement or at least a rough estimation. Given that, the user would then be able to choose the appropriate machine type, number of GPUs, machines etc. to launch their job.

The memory usage is composed of:

  • parameters
  • activations
  • gradients
  • optimizer states

We need to compute these and sum them up to give the total estimated memory usage.

cc @lantiga @carmocca @rasbt

awaelchli avatar Feb 07 '24 23:02 awaelchli

This can be based off

https://github.com/EleutherAI/cookbook/blob/main/calc/calc_transformer_mem.py https://vram.asmirnov.xyz/

This could be run at the beginning of the training script or be a separate script that you call.

(from #920)

carmocca avatar Feb 07 '24 23:02 carmocca