lightning-thunder
lightning-thunder copied to clipboard
[benchmark] add option to use torchao's float8 of dynamic scaling with fsdp2
This adds an option to use float8 of torchao. An example command to use float8 with FSDP2:
torchrun --nproc-per-node 8 --local-ranks-filter 0 --role rank --tee 3 thunder/benchmarks/benchmark_litgpt.py --model_name Llama-2-7b-hf --compile inductor --distributed_mode fsdp2 --shard_mode zero2 --use_torchao_fp8_linear true --use_torchao_fp8_allgather true --use_torchao_fp8_precompute_scale_for_fsdp true
- [x] update https://github.com/Lightning-AI/lightning-thunder/blob/d425fe46911753f1a96cf080a2becedb86885d2d/thunder/benchmarks/benchmark_litgpt.py#L635-L661 to include which fp8 things are used
Llama-2-7b-hf on 8 H100, using pjnl-20240819
as of https://github.com/Lightning-AI/lightning-thunder/pull/997/commits/5e5cf387b9d0edd94f5a39c4a7095d324ec9d4f7
if compiler is torch, fsdp2 is used.
| compiler | executors | bs | Tokens/s/GPU | Memory Used |
|---|---|---|---|---|
| torch | fp8: linear, all-gather, & precompute | 1 | 14322.12 | 39.44 |
| torch | fp8: linear, all-gather, & precompute | 2 | 17846.32 | 54.87 |
| torch | fp8: linear & all-gather | 1 | 14576.89 | 34.26 |
| torch | fp8: linear & all-gather | 2 | 18035.65 | 49.70 |
| torch | fp8: linear | 1 | 13869.85 | 40.64 |
| torch | fp8: linear | 2 | 17335.19 | 56.12 |
| torch | thunder_inductor_cat_cudnn_dynamo | 1 | 12579.53 | 40.21 |
| torch | thunder_inductor_cat_cudnn_dynamo | 2 | 13966.41 | 61.82 |
| thunder | inductor_cat_cudnn_transformerengine | 1 | 14265.98 | 52.95 |
| thunder | inductor_cat_cudnn_transformerengine | 2 | 15627.38 | 74.04 |