PiPPy
PiPPy copied to clipboard
How to use PiPPy for large models that won't fit on one GPU
Hello, I was wondering If someone could provide an example or some guidance on how to use PiPPy for models, that will not fit on one GPU. I want to run pipeline parallelism with Llama2 70B on a node with multiple a100 gpus. However, if I run the pippy_llama.py example, every process will just try to load the whole model on the GPU corresponding to its local rank, which will cause a CUDA out of memory error.
Hi that's indeed an important use case.
In the folder below, we have an CPU initialization example based on GPT2: https://github.com/pytorch/PiPPy/tree/main/examples/cpu_init PiPPy allows you to create the model on CPU, turn it into a pipeline, and move different stages onto corresponding GPUs.
We also have semi-support for meta init. Today, one can create a pipeline from a meta model:
with torch.device(meta):
model = Model(...)
pipe = pipeline(model, ...)
However, we are still working on loading weights into different pipeline stages on different processes (so as to turn the meta stages into materialized stages). We will update here when that's complete.
Hopefully, the CPU init example can unblock your use case for now. Though it may require some amount of CPU RAM on your machine.
Cc: @LucasLLC @wconstab
If you can share with us which type of checkpoint format you want support of, that would help prioritize things too.
If you can share with us which type of checkpoint format you want support of, that would help prioritize things too.
Thank you for getting back to me, the CPU Init example was indeed very helpful. I am using the Hugging Face Transformers library to load the Llama2 70B model using the .from_pretrained() method. The checkpoint format is:
hash.hash.lock (lock file) hash.hash.json (configuration file) hash.hash (binary file containing the model's parameters)
Cc @LucasLLC @wconstab
We'll be integrating PipPy into TorchTrain soon, and along with that we'll get meta-initialization or cpu-initialization working nicely as an example for folks to see. Are you unblocked for now?