axlearn icon indicating copy to clipboard operation
axlearn copied to clipboard

Save O_PROJ on Fuji 70B-v2 for TRN2

Open apoorvtintin opened this issue 8 months ago • 5 comments

Saving out-projection improves training throughput while still fitting in the mesh defined by neuron-(trn2|trn2n).48xlarge-64.

apoorvtintin avatar May 01 '25 22:05 apoorvtintin

Rebased the PR to fix failing CI

apoorvtintin avatar May 13 '25 20:05 apoorvtintin

I see the CI fails for test TestEvaluateFromFile.test_evaluate_from_eval_set with error #22 453.1 axlearn/open_api/common.py:440: KeyError

This is unrelated to the changes in this PR, I already rebased the PR to 12th May. Can I please get some guidance on how I can fix this? Thank you!

apoorvtintin avatar May 15 '25 17:05 apoorvtintin

I see the CI fails for test TestEvaluateFromFile.test_evaluate_from_eval_set with error #22 453.1 axlearn/open_api/common.py:440: KeyError

This is unrelated to the changes in this PR, I already rebased the PR to 12th May. Can I please get some guidance on how I can fix this? Thank you!

I'm disabling this test here: https://github.com/apple/axlearn/pull/1184 cc @gyin94

markblee avatar May 15 '25 18:05 markblee

Rebased to disable flaky test

apoorvtintin avatar May 19 '25 18:05 apoorvtintin

This pull request has been automatically marked as stale because it has been inactive for 60 days. It will be closed in 7 days if no further activity occurs. If you would like to continue working on this, please remove the stale label or leave a comment.

github-actions[bot] avatar Oct 21 '25 02:10 github-actions[bot]