fix the outdated end2end training examples of moe+torchtitan
as titled. the example here uses some outdated TorchTitan APIs. This PR fixes them and aligns the example with other torchtitan end-to-end usages.
# usage for test fp8_rowwise (default use fp8_rowwise)
python ./torchao/prototype/moe_training/examples/simple_moe_layer.py
# usage for test fp8_rowwise
python ./torchao/prototype/moe_training/examples/simple_moe_layer.py --scaling_type fp8_rowwise
# usage for test mxfp8
python ./torchao/prototype/moe_training/examples/simple_moe_layer.py --scaling_type mxfp8
results:
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3242
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:x: 1 New Failure
As of commit bf183746b5bd64d8db63a4854457b5a8b6f64cb8 with merge base 03c2d2897502457778cf9ba5cb9a66c6a406d8f3 ():
NEW FAILURE - The following job has failed:
- PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.
This comment was automatically generated by Dr. CI and updates every 15 minutes.
I made a few changes to set the random seed and allow it to pass scaling_type from CLI.
@danielvegamyhre yes, the outpout should be
> python ./torchao/prototype/moe_training/examples/simple_moe_layer.py
step 0 loss: 2656.0
step 1 loss: 2624.0
step 2 loss: 2592.0
step 3 loss: 2560.0
step 4 loss: 2528.0
step 5 loss: 2512.0
step 6 loss: 2480.0
step 7 loss: 2448.0
step 8 loss: 2432.0
step 9 loss: 2416.0
i have already revised the PR message and put the screenshoot there