lightning-thunder icon indicating copy to clipboard operation
lightning-thunder copied to clipboard

OOM for training on 4 nodes for falcon-40b and vicuna-33b-v1.3

Open mpatel31415 opened this issue 1 year ago • 1 comments

🐛 Bug

This might be related to old OOM issue, but the models and # nodes is different, so I decided to create another one.

We get OOM error, but torch.compile can run the models successfully.

To Reproduce

Please use: 4 node(s), each with 8 GPUs. Image "INTERNAL_IMAGE:pjnl-20240930" Training script: python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py
--model_name falcon-40b
--distributed_mode fsdp
--shard_mode zero3
--compile thunder
--checkpoint_activations False
--low_precision_mode none
--micro_batch_size 1

Expected behavior

We should be able to run the benchmarking script.

Environment

system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.7 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.13+git2cee59d libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+gitc4ae451 libraries.pip.torchmetrics 1.4.2 libraries.pip.torchvision 0.19.0a0+d23a6e1

mpatel31415 avatar Oct 01 '24 09:10 mpatel31415

falcon-40b seems to use parallel_residual=True so I expect this model config to work once https://github.com/Lightning-AI/lightning-thunder/issues/1175 (and https://github.com/Lightning-AI/lightning-thunder/issues/246) is resolved.

IvanYashchuk avatar Oct 02 '24 17:10 IvanYashchuk

We still see OOM for these models. I'm not sure if this is relevant, bu right now OOM is preceded by

[ERROR | nvfuser ]: An error occurred while executing nvFuser FusionDefinition 8.

mpatel31415 avatar Oct 28 '24 16:10 mpatel31415

Yes, it's probably the same OOM error just happening while nvFuser tries to allocate more memory than available, slightly different place in the code.

IvanYashchuk avatar Oct 30 '24 09:10 IvanYashchuk