zcain117 comments

Repositories
Issues
Comments

Results 4 comments of


                                            zcain117

EOFerror at the end of multiprocessing

I get the same error when using multiprocess TPU training (also using fairseq transformer model)

Add more comprehensive performance metrics

This repo mainly passes metrics that the user computes - I don't think there's any way to get examples/sec after the test is over if the user's test code hasn't...

Add more comprehensive performance metrics

Oh maybe you meant to add support for percentiles for any metric written to tensorboard, not to try to compute examples/sec. That should be doable

Add more comprehensive performance metrics

`time_to_accuracy` is also available now. A sample config that includes it: https://github.com/GoogleCloudPlatform/ml-testing-accelerators/tree/master/metrics_handler#metric_collection_config Start up time is possible but the user would need to write some event to Tensorboard to indicate...