Thunder and ThunderFX are slower than torch.compile for FP8 and falcon-7b and other models
🐛 Bug
As can be seen below Thunder is slower than torch.compile for single gpu training of falcon-7b:
Below are results for ThunderFX for multi-gpu training :
Batch sizes and sharding modes doesn't match, but these are the fastest options for ThunderFX:
- For the first row for micro batch size the same as torch.compile (6) we get even lower throughput - 46.19
- For the second row for micro batch size 7 and sharing mode zero3 we get throughput 93.5.
To Reproduce
Steps to reproduce the behavior:
python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
--model_name falcon-7b \
--compile thunder \
--low_precision_mode fp8-delayed-te \
--micro_batch_size 1
Expected behavior
Thunder should be as fast as torch.compile.
Environment
system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.8 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.20+git85c22a2 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+git96b30dc libraries.pip.torchmetrics 1.5.1 libraries.pip.torchvision 0.19.0a0+d23a6e1
Actually we see the same results for other models. Is one issue enough to track all of them? Below are the results:
Is one issue enough to track all of them?
Given the scale and universality, yeah, I think one issue is enough for all models.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.