torchtitan
torchtitan copied to clipboard
Add mxfp8 path
trafficstars
Stacked PRs:
- ->#1190
Add mxfp8 path
with-proxy CONFIG_FILE="torchtitan/models/llama3/train_configs/llama3_8b.toml " ./run_train.sh --model.print_after_conversion --training.compile --training.steps 50 --model.converters mxfloat8 --float8.recipe_name "mxfp8"
Review highlight
I wish we could do Polls in PRs but bike shed all the names and whether we should add a separate configmanger from float8 section and JobConfig entry
As well here is a version that encodes all of it into recipe name: https://github.com/pytorch/torchtitan/pull/1189
Logs:
[rank0]:/home/drisspg/.conda/envs/nightly/lib/python3.12/site-packages/torch/_inductor/lowering.py:1881: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank0]: warnings.warn(
[rank0]:[titan] 2025-05-13 18:27:58,957 - root - INFO - step: 1 loss: 12.2382 memory: 30.54GiB(17.12%) tps: 691 tflops: 40.00 mfu: 12.82%
[rank0]:[titan] 2025-05-13 18:27:58,957 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-05-13 18:28:05,743 - root - INFO - step: 10 loss: 9.8639 memory: 38.04GiB(21.32%) tps: 10,866 tflops: 629.32 mfu: 201.71%
[rank0]:[titan] 2025-05-13 18:28:13,048 - root - INFO - step: 20 loss: 8.3960 memory: 38.04GiB(21.32%) tps: 11,215 tflops: 649.50 mfu: 208.17%
[rank0]:[titan] 2025-05-13 18:28:20,337 - root - INFO - step: 30 loss: 7.7323 memory: 38.04GiB(21.32%) tps: 11,241 tflops: 651.01 mfu: 208.66%
[rank0]:[titan] 2025-05-13 18:28:27,637 - root - INFO - step: 40 loss: 7.3134 memory: 38.04GiB(21.32%) tps: 11,223 tflops: 649.96 mfu: 208.32%
[rank0]:[titan] 2025-05-13 18:28:34,172 - root - INFO - [GC] Peforming periodical GC collection. 0.04 seconds.
[rank0]:[titan] 2025-05-13 18:28:34,993 - root - INFO - step: 50 loss: 7.0908 memory: 38.04GiB(21.32%) tps: 11,138 tflops: 645.03 mfu: 206.74%
[rank0]:[titan] 2025-05-13 18:28:34,993 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-05-13 18:28:36,994 - root - INFO - Training completed
[rank0]:[titan] 2025-05-13 18:28:39,313 - root - INFO - Process group destroyed.