transformers [Benchmark] HF Trainer optimizers (Mar-2023)

This is a rerun of Adam torch vs. apex vs HF vs adafactor RTX-3090, A100 but added BNB's 8bit Adam optimizer and probably the software has improved/changed since 14 months as well.

note: 8-bit Optimizer

Actually this time it was run on a desktop PCIe 80GB A100 - so not the same hardware as the previous benchmark which was an SXM 40GB A100.

I'm using the specially written HF Trainer benchmarking tool that I developed specifically to make such benchmarks trivial to run and automatically get report tables.

So I'm running:

CUDA_VISIBLE_DEVICES=0 python scripts/benchmark/trainer-benchmark.py --base-cmd ' \
examples/pytorch/translation/run_translation.py --model_name_or_path t5-base --output_dir output_dir \
--do_train --label_smoothing 0.1 --logging_strategy no --save_strategy no --per_device_train_batch_size 32 \
--max_source_length 512 --max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: "  --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 \
' --target-metric-key train_samples_per_second --repeat-times 1 --variations '--optim adamw_torch|--optim adamw_bnb_8bit|--optim adamw_hf|--optim adafactor|--optim adamw_apex_fused' --report-metric-keys train_loss --base-variation '--optim adamw_torch'

You can see that I'm telling the tool to compare 5 optimizers: adamw_torch, adamw_bnb_8bit, adamw_hf, adafactor, adamw_apex_fused.

Memory usage wise we have per parameter:

2 bytes: adamw_bnb_8bit
4 bytes: adafactor
8 bytes: adamw_torch, adamw_hf, adamw_apex_fused

*** Setup

When publishing benchmarks it's crucial to log the versions that were used while running those, so here we go:

Datetime    : 2023-03-10 20:55:38

Software:
transformers: 4.27.0.dev0
torch       : 1.13.1
cuda        : 11.7
python      : 3.8.15

Hardware:
1 GPUs      : NVIDIA A100 80GB PCIe, 79.21GB

*** Results

Last year's benchmark showed that the speed ups percentage was about the same between fp16/bf16/fp32. Let's see what this year brings plus a new optimizer.

FP32

Variation	Train samples per second	Diff %	Train loss
--optim adamw_torch	102.77	0	2.21
--optim adamw_bnb_8bit	104.99	2	2.15
--optim adamw_hf	103.64	1	2.21
--optim adafactor	97.22	-5	2.21
--optim adamw_apex_fused	106.12	3	2.21

Observations:

The results are very different from the previous year's benchmark. While Adafactor is still the slowest, the rest are are pretty close by.
Very surprisingly the quantized 8-bit BNB Adam optimizer is faster than pytorch's 8-byte Adam optimizer! While it uses 1/4th of the memory of the latter! And its loss is even better!

BF16

(added --bf16 to the base command line)

Variation	Train samples per second	Diff %	Train loss
--optim adamw_torch	323.18	0	2.22
--optim adamw_bnb_8bit	348.29	8	2.16
--optim adamw_hf	333.07	3	2.22
--optim adafactor	274.36	-15	2.22
--optim adamw_apex_fused	359.46	11	2.22

Observations:

Again BNB beats every other optimizer at loss, while being only second to apex in speed.

FP16

(added --fp16 to the base command line)

Variation	Train samples per second	Diff %	Train loss
--optim adamw_torch	370.09	0	2.55
--optim adamw_bnb_8bit	383.21	4	2.45
--optim adamw_hf	373.66	1	2.55
--optim adafactor	356.84	-4	2.53
--optim adamw_apex_fused	380.50	3	2.55

Observations:

Here BNB even managed to beat apex. But since I run each only once it's possible that re-running multiple times might show a slightly different outcome.
Somehow BF16 appears to be slower than fp16 but it gives a much better loss (same loss as fp32). I wonder why?!

new addition! `--adamw_torch_fused`

edit: we added --adamw_torch_fused to HF Trainer, which runs almost as fast as --adamw_apex_fused - this option requires torch>=2.0 for fp32 and bf16, and torch>2.0 for fp16 as some bugs were fixed in torch==2.0 e.g. here is fp16 comparison:

Variation	Train samples per second	Diff %	Train loss
--optim adamw_torch_fused	387.10	3	2.66
--optim adamw_torch	377.61	0	2.66
--optim adamw_apex_fused	389.49	3	2.66

Mar 11 '23 05:03 stas00

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 21 '23 15:04 github-actions[bot]

Could you add some Lion benchmarks?

Apr 30 '23 23:04 R4ZZ3

It's not in the HF Trainer's arsenal of optimizers, if you'd like to make a PR to integrate it then it can be done.

May 03 '23 16:05 stas00

[Benchmark] HF Trainer optimizers (Mar-2023)

FP32

BF16

FP16

new addition! --adamw_torch_fused

new addition! `--adamw_torch_fused`