maxtext
maxtext copied to clipboard
[Not for Merge] [POC] Goodput async monitoring and upload to Tensorboard POC
This changes adds the following:
- Allows creating on a monitor object that spins up a secondary "monitor & upload" thread to query Goodput of the job using the ml-goodput-measurement pip package and and write a scalar metric to TB every interval period.
Tested:
- [x] Example run on v4-8 w/ ~180 steps here
Note: This is a POC and this change is intended to be moved to the cloud-accelerator-doagnostics and goodput package eventually.