lightning-thunder
lightning-thunder copied to clipboard
Likely memory fragmentation for larger models
🐛 Bug
Running LLaMa2 13B with FSDP ZeRO2 on 8xH100
torchrun --nproc_per_node=8 --nnodes=1 benchmark_litgpt.py --model_name Llama-2-13b-hf --compile thunder_cudnn --distributed_mode fsdp --shard_mode zero2 --bucketing_mode none --micro_batch_size 1 --global_batch_size 8
Average iter time: 867.59 ms
The performance looks worse than expected and on inspecting the timeline, there is a large portion where GPU is idle as there are many cudaMalloc and cudaFree memory operations happening.
Using expandable_segments in PyTorch, the performance significantly improves and there is no gap in the timeline.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --nproc_per_node=8 --nnodes=1 benchmark_litgpt.py --model_name=Llama-2-13b-hf --compile thunder_cudnn --distributed_mode=fsdp --shard_mode zero2 --bucketing_mode none --micro_batch_size 1 --global_batch_size 8
Average iter time: 729.44 ms
We should find out if this memory fragmentation is due to Thunder's interaction with the PyTorch memory allocator or something else.
@kshitij12345 - From our discussion, I remember you were looking into this. I believe this is what is causing the memory operations but needs further investigation.
cc @eqy re: fragmentation lunch discussion
Does TORCH_NCCL_AVOID_RECORD_STREAMS=1 help?
@eqy Yes, it does. Either of the two env variables give the same performance benefit. Is this fair to call this a memory fragmentation issue or is this something else you think?
As per offline discussion with @ptrblck , we should enable TORCH_NCCL_AVOID_RECORD_STREAMS=1 by default in thunder.
cc - @IvanYashchuk @mruberry Can we enable this env var by default in Thunder or should we rely on nvidia containers do enable this?
Ping @IvanYashchuk @ptrblck @eqy Any reason why we shouldn't enable this by default?
Let's set this variable on by default for Thunder-generated functions.
Quoting Carilli from the PR description that added this env variable:
Because we're juggling razor blades here and it's hard to test, recordStream avoidance is off by default, and existing default (aka recordStream-based) behavior is unchanged.
Since it's hard to test, let's enable it in Thunder by default.
Linking here relevant forum post for curious people: https://discuss.pytorch.org/t/cuda-allocation-lifetime-for-inputs-to-distributed-all-reduce/