OOM for training on 4 nodes for falcon-40b and vicuna-33b-v1.3
🐛 Bug
This might be related to old OOM issue, but the models and # nodes is different, so I decided to create another one.
We get OOM error, but torch.compile can run the models successfully.
To Reproduce
Please use:
4 node(s), each with 8 GPUs.
Image "INTERNAL_IMAGE:pjnl-20240930"
Training script:
python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py
--model_name falcon-40b
--distributed_mode fsdp
--shard_mode zero3
--compile thunder
--checkpoint_activations False
--low_precision_mode none
--micro_batch_size 1
Expected behavior
We should be able to run the benchmarking script.
Environment
system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.7 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.13+git2cee59d libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+gitc4ae451 libraries.pip.torchmetrics 1.4.2 libraries.pip.torchvision 0.19.0a0+d23a6e1
falcon-40b seems to use parallel_residual=True so I expect this model config to work once https://github.com/Lightning-AI/lightning-thunder/issues/1175 (and https://github.com/Lightning-AI/lightning-thunder/issues/246) is resolved.
We still see OOM for these models. I'm not sure if this is relevant, bu right now OOM is preceded by
[ERROR | nvfuser ]: An error occurred while executing nvFuser FusionDefinition 8.
Yes, it's probably the same OOM error just happening while nvFuser tries to allocate more memory than available, slightly different place in the code.