[Benchmark] HF Trainer optimizers (Mar-2023)
This is a rerun of Adam torch vs. apex vs HF vs adafactor RTX-3090, A100 but added BNB's 8bit Adam optimizer and probably the software has improved/changed since 14 months as well.
note: 8-bit Optimizer
Actually this time it was run on a desktop PCIe 80GB A100 - so not the same hardware as the previous benchmark which was an SXM 40GB A100.
I'm using the specially written HF Trainer benchmarking tool that I developed specifically to make such benchmarks trivial to run and automatically get report tables.
So I'm running:
CUDA_VISIBLE_DEVICES=0 python scripts/benchmark/trainer-benchmark.py --base-cmd ' \
examples/pytorch/translation/run_translation.py --model_name_or_path t5-base --output_dir output_dir \
--do_train --label_smoothing 0.1 --logging_strategy no --save_strategy no --per_device_train_batch_size 32 \
--max_source_length 512 --max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 \
' --target-metric-key train_samples_per_second --repeat-times 1 --variations '--optim adamw_torch|--optim adamw_bnb_8bit|--optim adamw_hf|--optim adafactor|--optim adamw_apex_fused' --report-metric-keys train_loss --base-variation '--optim adamw_torch'
You can see that I'm telling the tool to compare 5 optimizers: adamw_torch, adamw_bnb_8bit, adamw_hf, adafactor, adamw_apex_fused.
Memory usage wise we have per parameter:
- 2 bytes:
adamw_bnb_8bit - 4 bytes:
adafactor - 8 bytes:
adamw_torch,adamw_hf,adamw_apex_fused
*** Setup
When publishing benchmarks it's crucial to log the versions that were used while running those, so here we go:
Datetime : 2023-03-10 20:55:38
Software:
transformers: 4.27.0.dev0
torch : 1.13.1
cuda : 11.7
python : 3.8.15
Hardware:
1 GPUs : NVIDIA A100 80GB PCIe, 79.21GB
*** Results
Last year's benchmark showed that the speed ups percentage was about the same between fp16/bf16/fp32. Let's see what this year brings plus a new optimizer.
FP32
| Variation | Train samples per second |
Diff % |
Train loss |
|---|---|---|---|
| --optim adamw_torch | 102.77 | 0 | 2.21 |
| --optim adamw_bnb_8bit | 104.99 | 2 | 2.15 |
| --optim adamw_hf | 103.64 | 1 | 2.21 |
| --optim adafactor | 97.22 | -5 | 2.21 |
| --optim adamw_apex_fused | 106.12 | 3 | 2.21 |
Observations:
- The results are very different from the previous year's benchmark. While Adafactor is still the slowest, the rest are are pretty close by.
- Very surprisingly the quantized 8-bit BNB Adam optimizer is faster than pytorch's 8-byte Adam optimizer! While it uses 1/4th of the memory of the latter! And its loss is even better!
BF16
(added --bf16 to the base command line)
| Variation | Train samples per second |
Diff % |
Train loss |
|---|---|---|---|
| --optim adamw_torch | 323.18 | 0 | 2.22 |
| --optim adamw_bnb_8bit | 348.29 | 8 | 2.16 |
| --optim adamw_hf | 333.07 | 3 | 2.22 |
| --optim adafactor | 274.36 | -15 | 2.22 |
| --optim adamw_apex_fused | 359.46 | 11 | 2.22 |
Observations:
- Again BNB beats every other optimizer at loss, while being only second to apex in speed.
FP16
(added --fp16 to the base command line)
| Variation | Train samples per second |
Diff % |
Train loss |
|---|---|---|---|
| --optim adamw_torch | 370.09 | 0 | 2.55 |
| --optim adamw_bnb_8bit | 383.21 | 4 | 2.45 |
| --optim adamw_hf | 373.66 | 1 | 2.55 |
| --optim adafactor | 356.84 | -4 | 2.53 |
| --optim adamw_apex_fused | 380.50 | 3 | 2.55 |
Observations:
- Here BNB even managed to beat apex. But since I run each only once it's possible that re-running multiple times might show a slightly different outcome.
- Somehow BF16 appears to be slower than fp16 but it gives a much better loss (same loss as fp32). I wonder why?!
new addition! --adamw_torch_fused
edit: we added --adamw_torch_fused to HF Trainer, which runs almost as fast as --adamw_apex_fused - this option requires torch>=2.0 for fp32 and bf16, and torch>2.0 for fp16 as some bugs were fixed in torch==2.0 e.g. here is fp16 comparison:
| Variation | Train samples per second |
Diff % |
Train loss |
|---|---|---|---|
| --optim adamw_torch_fused | 387.10 | 3 | 2.66 |
| --optim adamw_torch | 377.61 | 0 | 2.66 |
| --optim adamw_apex_fused | 389.49 | 3 | 2.66 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Could you add some Lion benchmarks?
It's not in the HF Trainer's arsenal of optimizers, if you'd like to make a PR to integrate it then it can be done.