lightning-thunder OOM for training on 4 nodes for falcon-40b and vicuna-33b-v1.3

🐛 Bug

This might be related to old OOM issue, but the models and # nodes is different, so I decided to create another one.

We get OOM error, but torch.compile can run the models successfully.

To Reproduce

Please use: 4 node(s), each with 8 GPUs. Image "INTERNAL_IMAGE:pjnl-20240930" Training script: python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py
--model_name falcon-40b
--distributed_mode fsdp
--shard_mode zero3
--compile thunder
--checkpoint_activations False
--low_precision_mode none
--micro_batch_size 1

Expected behavior

We should be able to run the benchmarking script.

Environment

system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.7 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.13+git2cee59d libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+gitc4ae451 libraries.pip.torchmetrics 1.4.2 libraries.pip.torchvision 0.19.0a0+d23a6e1

Oct 01 '24 09:10 mpatel31415

falcon-40b seems to use parallel_residual=True so I expect this model config to work once https://github.com/Lightning-AI/lightning-thunder/issues/1175 (and https://github.com/Lightning-AI/lightning-thunder/issues/246) is resolved.

Oct 02 '24 17:10 IvanYashchuk

We still see OOM for these models. I'm not sure if this is relevant, bu right now OOM is preceded by

[ERROR | nvfuser ]: An error occurred while executing nvFuser FusionDefinition 8.

Oct 28 '24 16:10 mpatel31415

Yes, it's probably the same OOM error just happening while nvFuser tries to allocate more memory than available, slightly different place in the code.

Oct 30 '24 09:10 IvanYashchuk