[Regressions] ThunderFX is slower than 2 weeks ago for 2 models
🐛 Bug
Recently found regressions:
To Reproduce
All parameters to benchmark_litgpt.py are visible in the attached image.
Environment
Tested on pjnl-20241122 (as in the Latest image date in the screenshot).
system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.3.001 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.9 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.23+gitb5e5182 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+gitecf3bae libraries.pip.torchao 0.6.1 libraries.pip.torchmetrics 1.6.0 libraries.pip.torchvision 0.19.0a0+d23a6e1
To my mind, this seems to be fundamentally "memory use" and not "compute perf" if the batch size needed to be lowered.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.