PiPPy
PiPPy copied to clipboard
Unexpected Memory Usage and Latency with PP
When running the examples/llama/pippy_llama.py
script on two A800 GPUs, each rank is observed to consume the full model_size in memory, rather than sharing the weights across both GPUs. Additionally, the latency performance differs from expected values.
Rank: 0 Forward Latency: 0.7621021270751953s Peak memory: 26.341GiB Rank: 1 Forward Latency: 0.8798844814300537s Peak memory: 26.204GiB
For comparison, when utilizing a single GPU, the performance metrics are as follows:
Rank: 0 Forward Latency: 0.45336079597473145s Peak memory: 26.252GiB
These results are measured by following codes:
torch.cuda.reset_peak_memory_stats(device)
start = time.time()
if rank == 0:
args = inputs["input_ids"]
else:
args = None
output = schedule.step(args)
end = time.time()
peak_mem = torch.cuda.max_memory_allocated(device)
Upon trying the initialization settings from examples/cpu_init/gpt2_cpu_init.py
, a RuntimeError occurs when using the stage on the CUDA device created from a pipeline on the CPU device:
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
So I wonder if it's normal for PP with this kind of memory cunsumption and latency?
PS: The issue persists across different versions of the software:
- the latest torchpippy installed from the source and torch==2.4.0.dev.20240411
- torchpippy==0.2.0 from pip and torch==2.2.2
Hi, on latency, if you measure the first iteration, it will include the distributed initialization time (e.g. NCCL communicator initialization). You can try give it some warm-up runs and then measure the latency.
On examples/cpu_init/gpt2_cpu_init.py
, I couldn't repro the error, whether with 2 ranks or 4 ranks.
Are you using the llama model with cpu init?
On memory consumption, it is expected to be high if you initialize the model on real device. We are actively developing technique to support creating initial model on meta device:
with torch.device("meta"):
model = Model()
pipe = pipeline(model, ...)
stage_mod = pipe.get_stage_module(stage_index)
stage_mod.load_state_dict(torch.load(PATH))
Thanks for your reply! I've confirmed that latency measured with several warm up runs is normal.
On
examples/cpu_init/gpt2_cpu_init.py
, I couldn't repro the error, whether with 2 ranks or 4 ranks. Are you using the llama model with cpu init?
Yes, since I noticed the issue, I'm using examples/llama/pippy_llama.py
with the cpu_init
method, however it seems that some OPs(maybe index?) do not work properly using the stage on the CUDA device created from a pipeline on the CPU device.