lightning-thunder [benchmark] add option to use torchao's float8 of dynamic scaling with fsdp2

[benchmark] add option to use torchao's float8 of dynamic scaling with fsdp2

Open crcrpar opened this issue 1 year ago • 0 comments

This adds an option to use float8 of torchao. An example command to use float8 with FSDP2:

torchrun --nproc-per-node 8 --local-ranks-filter 0 --role rank --tee 3 thunder/benchmarks/benchmark_litgpt.py --model_name Llama-2-7b-hf --compile inductor --distributed_mode fsdp2 --shard_mode zero2 --use_torchao_fp8_linear true --use_torchao_fp8_allgather true --use_torchao_fp8_precompute_scale_for_fsdp true

[x] update https://github.com/Lightning-AI/lightning-thunder/blob/d425fe46911753f1a96cf080a2becedb86885d2d/thunder/benchmarks/benchmark_litgpt.py#L635-L661 to include which fp8 things are used

Llama-2-7b-hf on 8 H100, using pjnl-20240819

as of https://github.com/Lightning-AI/lightning-thunder/pull/997/commits/5e5cf387b9d0edd94f5a39c4a7095d324ec9d4f7

if compiler is torch, fsdp2 is used.

compiler	executors	bs	Tokens/s/GPU	Memory Used
torch	fp8: linear, all-gather, & precompute	1	14322.12	39.44
torch	fp8: linear, all-gather, & precompute	2	17846.32	54.87
torch	fp8: linear & all-gather	1	14576.89	34.26
torch	fp8: linear & all-gather	2	18035.65	49.70
torch	fp8: linear	1	13869.85	40.64
torch	fp8: linear	2	17335.19	56.12
torch	thunder_inductor_cat_cudnn_dynamo	1	12579.53	40.21
torch	thunder_inductor_cat_cudnn_dynamo	2	13966.41	61.82
thunder	inductor_cat_cudnn_transformerengine	1	14265.98	52.95
thunder	inductor_cat_cudnn_transformerengine	2	15627.38	74.04

Aug 19 '24 13:08 crcrpar

lightning-thunder lightning-thunder copied to clipboard

[benchmark] add option to use torchao's float8 of dynamic scaling with fsdp2

Llama-2-7b-hf on 8 H100, using pjnl-20240819

lightning-thunder
lightning-thunder copied to clipboard