lightning-thunder [Regressions] ThunderFX is slower than 2 weeks ago for 2 models

🐛 Bug

Recently found regressions: Screenshot 2024-11-27 at 10 03 14

To Reproduce

All parameters to benchmark_litgpt.py are visible in the attached image.

Environment

Tested on pjnl-20241122 (as in the Latest image date in the screenshot).

system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.3.001 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.9 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.23+gitb5e5182 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+gitecf3bae libraries.pip.torchao 0.6.1 libraries.pip.torchmetrics 1.6.0 libraries.pip.torchvision 0.19.0a0+d23a6e1

Nov 27 '24 09:11 wprazuch

To my mind, this seems to be fundamentally "memory use" and not "compute perf" if the batch size needed to be lowered.

Dec 02 '24 19:12 t-vi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 16 '25 05:04 stale[bot]