FR: better handling for out-of-order and duplicate steps
Hi there,
For our particular application, we process metrics in a way that makes it hard to integrate with tensorboard without some finagling. Two features would be rad:
-
Out of order steps : we process / write metrics asynchronously, meaning that the steps in the associated tf.Event files are written out of order, causing garbled tensorboard plots. It'd be awesome if tensorboard had a way to digest steps out of order, and plot using just step as the x-value, order-independent.
-
Overwriting stale steps : we aggregate our metrics from different rounds of computation, meaning a metric written for a particular step (which we're overloading to mean something else for our application) can become stale. We need a way to overwrite this information as it becomes stale.
Related to b/72185341 internally.
Any news on this? With the general crisis of GPU availability, also on Google cloud, this kind of use case it is very common as we rely more on resuming the training from checkpoints.
Any news on this? It is very annoying with Goolge GKE spot instances. You are constrained to log only at checkpoint step.