PiPPy icon indicating copy to clipboard operation
PiPPy copied to clipboard

Unexpected Memory Usage and Latency with PP

Open Lucius-THU opened this issue 10 months ago • 4 comments

When running the examples/llama/pippy_llama.py script on two A800 GPUs, each rank is observed to consume the full model_size in memory, rather than sharing the weights across both GPUs. Additionally, the latency performance differs from expected values.

Rank: 0 Forward Latency: 0.7621021270751953s Peak memory: 26.341GiB Rank: 1 Forward Latency: 0.8798844814300537s Peak memory: 26.204GiB

For comparison, when utilizing a single GPU, the performance metrics are as follows:

Rank: 0 Forward Latency: 0.45336079597473145s Peak memory: 26.252GiB

These results are measured by following codes:

torch.cuda.reset_peak_memory_stats(device)
start = time.time()

if rank == 0:
    args = inputs["input_ids"]
else:
    args = None
output = schedule.step(args)

end = time.time()
peak_mem = torch.cuda.max_memory_allocated(device)

Upon trying the initialization settings from examples/cpu_init/gpt2_cpu_init.py, a RuntimeError occurs when using the stage on the CUDA device created from a pipeline on the CPU device:

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

So I wonder if it's normal for PP with this kind of memory cunsumption and latency?

PS: The issue persists across different versions of the software:

  1. the latest torchpippy installed from the source and torch==2.4.0.dev.20240411
  2. torchpippy==0.2.0 from pip and torch==2.2.2

Lucius-THU avatar Apr 12 '24 06:04 Lucius-THU

Hi, on latency, if you measure the first iteration, it will include the distributed initialization time (e.g. NCCL communicator initialization). You can try give it some warm-up runs and then measure the latency.

kwen2501 avatar Apr 12 '24 20:04 kwen2501

On examples/cpu_init/gpt2_cpu_init.py, I couldn't repro the error, whether with 2 ranks or 4 ranks. Are you using the llama model with cpu init?

kwen2501 avatar Apr 12 '24 20:04 kwen2501

On memory consumption, it is expected to be high if you initialize the model on real device. We are actively developing technique to support creating initial model on meta device:

with torch.device("meta"):
    model = Model()

pipe = pipeline(model, ...)
stage_mod = pipe.get_stage_module(stage_index)
stage_mod.load_state_dict(torch.load(PATH))

kwen2501 avatar Apr 12 '24 20:04 kwen2501

Thanks for your reply! I've confirmed that latency measured with several warm up runs is normal.

On examples/cpu_init/gpt2_cpu_init.py, I couldn't repro the error, whether with 2 ranks or 4 ranks. Are you using the llama model with cpu init?

Yes, since I noticed the issue, I'm using examples/llama/pippy_llama.py with the cpu_init method, however it seems that some OPs(maybe index?) do not work properly using the stage on the CUDA device created from a pipeline on the CPU device.

Lucius-THU avatar Apr 13 '24 00:04 Lucius-THU