ml-testing-accelerators icon indicating copy to clipboard operation
ml-testing-accelerators copied to clipboard

Add more comprehensive performance metrics

Open allenwang28 opened this issue 4 years ago • 4 comments

E.g.

  • p50, p95, p99 of examples/sec
  • Start up and wall time

allenwang28 avatar May 13 '20 17:05 allenwang28

This repo mainly passes metrics that the user computes - I don't think there's any way to get examples/sec after the test is over if the user's test code hasn't written that to Tensorboard. That would be a change to make in the model code.

total_wall_time is already being computed for all the tests - you can see an example here

What would be a good definition for start up time?

zcain117 avatar May 13 '20 17:05 zcain117

Oh maybe you meant to add support for percentiles for any metric written to tensorboard, not to try to compute examples/sec. That should be doable

zcain117 avatar May 13 '20 17:05 zcain117

Yep! I think mostly percentile support is what I had in mind for this feature request.

I think start up time is not as important, but that would be from the time the command executes to the time the training starts.

Another important statistic I think would be time to accuracy as well.

allenwang28 avatar May 13 '20 17:05 allenwang28

time_to_accuracy is also available now. A sample config that includes it: https://github.com/GoogleCloudPlatform/ml-testing-accelerators/tree/master/metrics_handler#metric_collection_config

Start up time is possible but the user would need to write some event to Tensorboard to indicate that training has started. As a first step, we could just grab the earliest Tensorboard entry of any kind and use the delta of job start time and earliest Tensorboard entry

zcain117 avatar May 13 '20 18:05 zcain117