ao icon indicating copy to clipboard operation
ao copied to clipboard

fix the outdated end2end training examples of moe+torchtitan

Open rakkit opened this issue 1 month ago • 3 comments

as titled. the example here uses some outdated TorchTitan APIs. This PR fixes them and aligns the example with other torchtitan end-to-end usages.

# usage for test fp8_rowwise (default use fp8_rowwise)
python ./torchao/prototype/moe_training/examples/simple_moe_layer.py 
# usage for test fp8_rowwise
python ./torchao/prototype/moe_training/examples/simple_moe_layer.py  --scaling_type fp8_rowwise

# usage for test mxfp8
python ./torchao/prototype/moe_training/examples/simple_moe_layer.py  --scaling_type mxfp8

results: image

rakkit avatar Oct 24 '25 16:10 rakkit

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3242

Note: Links to docs will display an error until the docs builds have been completed.

:x: 1 New Failure

As of commit bf183746b5bd64d8db63a4854457b5a8b6f64cb8 with merge base 03c2d2897502457778cf9ba5cb9a66c6a406d8f3 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Oct 24 '25 16:10 pytorch-bot[bot]

I made a few changes to set the random seed and allow it to pass scaling_type from CLI.

rakkit avatar Oct 27 '25 18:10 rakkit

@danielvegamyhre yes, the outpout should be

> python ./torchao/prototype/moe_training/examples/simple_moe_layer.py 
step 0 loss: 2656.0
step 1 loss: 2624.0
step 2 loss: 2592.0
step 3 loss: 2560.0
step 4 loss: 2528.0
step 5 loss: 2512.0
step 6 loss: 2480.0
step 7 loss: 2448.0
step 8 loss: 2432.0
step 9 loss: 2416.0

i have already revised the PR message and put the screenshoot there

rakkit avatar Oct 27 '25 18:10 rakkit