lightning-thunder Thunder and ThunderFX are slower than torch.compile for FP8 and falcon-7b and other models

🐛 Bug

As can be seen below Thunder is slower than torch.compile for single gpu training of falcon-7b:

Below are results for ThunderFX for multi-gpu training :

Batch sizes and sharding modes doesn't match, but these are the fastest options for ThunderFX:

For the first row for micro batch size the same as torch.compile (6) we get even lower throughput - 46.19
For the second row for micro batch size 7 and sharing mode zero3 we get throughput 93.5.

To Reproduce

Steps to reproduce the behavior:

python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
    --model_name falcon-7b \
    --compile thunder \
    --low_precision_mode fp8-delayed-te  \
    --micro_batch_size 1

Expected behavior

Thunder should be as fast as torch.compile.

Environment

system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.8 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.20+git85c22a2 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+git96b30dc libraries.pip.torchmetrics 1.5.1 libraries.pip.torchvision 0.19.0a0+d23a6e1

Oct 30 '24 08:10 mpatel31415

Actually we see the same results for other models. Is one issue enough to track all of them? Below are the results:

Oct 30 '24 08:10 mpatel31415

Is one issue enough to track all of them?

Given the scale and universality, yeah, I think one issue is enough for all models.

Nov 01 '24 18:11 tfogal

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 16 '25 06:04 stale[bot]