axlearn icon indicating copy to clipboard operation
axlearn copied to clipboard

Draft PR to add GoodPut Measurement

Open jiya-zhang opened this issue 1 year ago • 0 comments

Adding demo code as a skeleton to integrate GoodPut measurement into AXLearn.

To do's:

  1. Consider localizing GoodPut measurement to within trainer.py
  2. Make the measurement configurable
  3. Automatically pick up run_name from job config
  4. In order to obtain the GoodPut throughout a training's life time, consider repeatedly calling GoodPut Calculator and write the results to a persistent data store
  5. Add tests in AXLearn
  6. Test E2E for multislice jobs and larger single slice jobs
  7. Test E2E for long-running jobs (8hr+)
  8. Test for jobs running on v5e, v5p (so far, we only tested v4-8)

jiya-zhang avatar Mar 14 '24 22:03 jiya-zhang