torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

Add mxfp8 path

Open drisspg opened this issue 6 months ago • 0 comments
trafficstars

Stacked PRs:

  • ->#1190

Add mxfp8 path

with-proxy CONFIG_FILE="torchtitan/models/llama3/train_configs/llama3_8b.toml " ./run_train.sh --model.print_after_conversion --training.compile --training.steps 50 --model.converters mxfloat8 --float8.recipe_name "mxfp8"

Review highlight

I wish we could do Polls in PRs but bike shed all the names and whether we should add a separate configmanger from float8 section and JobConfig entry

As well here is a version that encodes all of it into recipe name: https://github.com/pytorch/torchtitan/pull/1189

Logs:

[rank0]:/home/drisspg/.conda/envs/nightly/lib/python3.12/site-packages/torch/_inductor/lowering.py:1881: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank0]:  warnings.warn(
[rank0]:[titan] 2025-05-13 18:27:58,957 - root - INFO - step:  1  loss: 12.2382  memory: 30.54GiB(17.12%)  tps: 691  tflops: 40.00  mfu: 12.82%
[rank0]:[titan] 2025-05-13 18:27:58,957 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-05-13 18:28:05,743 - root - INFO - step: 10  loss:  9.8639  memory: 38.04GiB(21.32%)  tps: 10,866  tflops: 629.32  mfu: 201.71%
[rank0]:[titan] 2025-05-13 18:28:13,048 - root - INFO - step: 20  loss:  8.3960  memory: 38.04GiB(21.32%)  tps: 11,215  tflops: 649.50  mfu: 208.17%
[rank0]:[titan] 2025-05-13 18:28:20,337 - root - INFO - step: 30  loss:  7.7323  memory: 38.04GiB(21.32%)  tps: 11,241  tflops: 651.01  mfu: 208.66%
[rank0]:[titan] 2025-05-13 18:28:27,637 - root - INFO - step: 40  loss:  7.3134  memory: 38.04GiB(21.32%)  tps: 11,223  tflops: 649.96  mfu: 208.32%
[rank0]:[titan] 2025-05-13 18:28:34,172 - root - INFO - [GC] Peforming periodical GC collection. 0.04 seconds.
[rank0]:[titan] 2025-05-13 18:28:34,993 - root - INFO - step: 50  loss:  7.0908  memory: 38.04GiB(21.32%)  tps: 11,138  tflops: 645.03  mfu: 206.74%
[rank0]:[titan] 2025-05-13 18:28:34,993 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-05-13 18:28:36,994 - root - INFO - Training completed
[rank0]:[titan] 2025-05-13 18:28:39,313 - root - INFO - Process group destroyed.

drisspg avatar May 14 '25 01:05 drisspg