ml-testing-accelerators Add more comprehensive performance metrics

Add more comprehensive performance metrics

Open allenwang28 opened this issue 4 years ago • 4 comments

E.g.

p50, p95, p99 of examples/sec
Start up and wall time

May 13 '20 17:05 allenwang28

This repo mainly passes metrics that the user computes - I don't think there's any way to get examples/sec after the test is over if the user's test code hasn't written that to Tensorboard. That would be a change to make in the model code.

total_wall_time is already being computed for all the tests - you can see an example here

What would be a good definition for start up time?

May 13 '20 17:05 zcain117

Oh maybe you meant to add support for percentiles for any metric written to tensorboard, not to try to compute examples/sec. That should be doable

May 13 '20 17:05 zcain117

Yep! I think mostly percentile support is what I had in mind for this feature request.

I think start up time is not as important, but that would be from the time the command executes to the time the training starts.

Another important statistic I think would be time to accuracy as well.

May 13 '20 17:05 allenwang28

time_to_accuracy is also available now. A sample config that includes it: https://github.com/GoogleCloudPlatform/ml-testing-accelerators/tree/master/metrics_handler#metric_collection_config

Start up time is possible but the user would need to write some event to Tensorboard to indicate that training has started. As a first step, we could just grab the earliest Tensorboard entry of any kind and use the delta of job start time and earliest Tensorboard entry

May 13 '20 18:05 zcain117

ml-testing-accelerators ml-testing-accelerators copied to clipboard

Add more comprehensive performance metrics

ml-testing-accelerators
ml-testing-accelerators copied to clipboard