maxtext icon indicating copy to clipboard operation
maxtext copied to clipboard

[Not for Merge] [POC] Goodput async monitoring and upload to Tensorboard POC

Open dipannita08 opened this issue 9 months ago • 0 comments

This changes adds the following:

  • Allows creating on a monitor object that spins up a secondary "monitor & upload" thread to query Goodput of the job using the ml-goodput-measurement pip package and and write a scalar metric to TB every interval period.

Tested:

  • [x] Example run on v4-8 w/ ~180 steps here

Note: This is a POC and this change is intended to be moved to the cloud-accelerator-doagnostics and goodput package eventually.

dipannita08 avatar May 14 '24 20:05 dipannita08