ml-testing-accelerators
ml-testing-accelerators copied to clipboard
Add more comprehensive performance metrics
E.g.
- p50, p95, p99 of examples/sec
- Start up and wall time
This repo mainly passes metrics that the user computes - I don't think there's any way to get examples/sec after the test is over if the user's test code hasn't written that to Tensorboard. That would be a change to make in the model code.
total_wall_time
is already being computed for all the tests - you can see an example here
What would be a good definition for start up time?
Oh maybe you meant to add support for percentiles for any metric written to tensorboard, not to try to compute examples/sec. That should be doable
Yep! I think mostly percentile support is what I had in mind for this feature request.
I think start up time is not as important, but that would be from the time the command executes to the time the training starts.
Another important statistic I think would be time to accuracy as well.
time_to_accuracy
is also available now. A sample config that includes it: https://github.com/GoogleCloudPlatform/ml-testing-accelerators/tree/master/metrics_handler#metric_collection_config
Start up time is possible but the user would need to write some event to Tensorboard to indicate that training has started. As a first step, we could just grab the earliest Tensorboard entry of any kind and use the delta of job start time and earliest Tensorboard entry