lightning-thunder icon indicating copy to clipboard operation
lightning-thunder copied to clipboard

[benchmark] add option to use torchao's float8 of dynamic scaling with fsdp2

Open crcrpar opened this issue 1 year ago • 0 comments

This adds an option to use float8 of torchao. An example command to use float8 with FSDP2:

torchrun --nproc-per-node 8 --local-ranks-filter 0 --role rank --tee 3 thunder/benchmarks/benchmark_litgpt.py --model_name Llama-2-7b-hf --compile inductor --distributed_mode fsdp2 --shard_mode zero2 --use_torchao_fp8_linear true --use_torchao_fp8_allgather true --use_torchao_fp8_precompute_scale_for_fsdp true
  • [x] update https://github.com/Lightning-AI/lightning-thunder/blob/d425fe46911753f1a96cf080a2becedb86885d2d/thunder/benchmarks/benchmark_litgpt.py#L635-L661 to include which fp8 things are used

Llama-2-7b-hf on 8 H100, using pjnl-20240819

as of https://github.com/Lightning-AI/lightning-thunder/pull/997/commits/5e5cf387b9d0edd94f5a39c4a7095d324ec9d4f7

if compiler is torch, fsdp2 is used.

compiler executors                             bs Tokens/s/GPU Memory Used
torch fp8: linear, all-gather, & precompute 1 14322.12 39.44
torch fp8: linear, all-gather, & precompute 2 17846.32 54.87
torch fp8: linear & all-gather 1 14576.89 34.26
torch fp8: linear & all-gather 2 18035.65 49.70
torch fp8: linear 1 13869.85 40.64
torch fp8: linear 2 17335.19 56.12
torch thunder_inductor_cat_cudnn_dynamo 1 12579.53 40.21
torch thunder_inductor_cat_cudnn_dynamo 2 13966.41 61.82
thunder inductor_cat_cudnn_transformerengine 1 14265.98 52.95
thunder inductor_cat_cudnn_transformerengine 2 15627.38 74.04

crcrpar avatar Aug 19 '24 13:08 crcrpar