axlearn
axlearn copied to clipboard
Draft PR to add GoodPut Measurement
Adding demo code as a skeleton to integrate GoodPut measurement into AXLearn.
To do's:
- Consider localizing GoodPut measurement to within trainer.py
- Make the measurement configurable
- Automatically pick up run_name from job config
- In order to obtain the GoodPut throughout a training's life time, consider repeatedly calling GoodPut Calculator and write the results to a persistent data store
- Add tests in AXLearn
- Test E2E for multislice jobs and larger single slice jobs
- Test E2E for long-running jobs (8hr+)
- Test for jobs running on v5e, v5p (so far, we only tested v4-8)