ashpy [BUG] - Measuring performance at the end of each epoch, disables training logging

The title says pretty much all.

If I want to measure the performance of the model only at the end of every epoch, the (for example) ClassifierTrainer offers me the possibility of setting a value for the logging frequency to something <= 0.

In this way, I measure the performance at the end of every epoch on the validation set (it works).

However, on tensorboard I only see the plot of the validation curves, the training curves aren't displayed anymore

Mar 11 '20 18:03 galeone

The issue was actually related to measure_performance_freq=-1

Mar 27 '20 11:03 EmanueleGhelfi

Let's define a coherent behavior.

The parameter measure_performance_freq is responsible for the frequency of performance measurements.

In the current implementation the average loss is measured as a performance metric.

We can measure the performance at the end of every epoch also on the training set (easy).

Regarding the training curve at every step we have to define what to do.

Mar 27 '20 11:03 EmanueleGhelfi

The parameter measure_performance_freq is responsible for the frequency of performance measurements.

Performance measurements during training and validation.

In the current implementation the average loss is measured as a performance metric.

Exactly, thus it should follow the same logging frequency of any other metric

We can measure the performance at the end of every epoch also on the training set (easy).

We definitely should do this, it makes no sense to have a plot like the one I reported.

Regarding the training curve at every step we have to define what to do.

Suggestions?

Mar 27 '20 11:03 galeone

Let's talk about the Classifier case.

ClassifierLoss (the metric) uses the loss defined in the classifier to measure the actual loss. In the classifier trainer the ClassifierLoss is added to the set of metrics. The metrics are treated transparently and the only method to call when there is the need to measure performance is: _measure_performance()

That loops on the metrics and takes care of everything.

So, in order to plot the training loss at each step on TB we can:

use directly the attribute avg_loss of the classifier in the training loop. In this case however we should take care of not plotting two times the same metric.
Remove the avg_loss from the metrics and treat it in a different way
Refactor everything
Plot directly the loss returned from the train_step
I don't know

Mar 27 '20 11:03 EmanueleGhelfi

Point 3: refactor everything.

It happens often that our loss function is the composition of executors (e.g. is a sum executor).

Term A + Term B.

With the actual way of logging, I can only have the plot of the term "Term A + Term B" and not 3 plots:

Term A
Term B
Term A + Term B

Mar 27 '20 11:03 galeone

I added the log of the losses. However there is one thing to remember. When we are in a distributed execution with 2 GPUs for example, we have two calls to the _train_step method. In the associated PR I implemented the tensorboard log inside the loss object (executor). However in a distributed scenario the object is the same, thus we have a concurrent execution of the loss.log method. Only the last log is taken into account. Also, in a distributed execution we should reduce the value of each term of the loss across the devices. This becomes difficult in the current implementation.
Moreover, the loss does not know the strategy at the moment. Actually it can understand the strategy but in any case the computed tensors are just normal tensors.

Apr 02 '20 10:04 EmanueleGhelfi

Current implementation is visible at: https://tensorboard.dev/experiment/e4pojFWfQWeksPVMQJLTgA/#scalars

ashpy/losses/GeneratorLSGAN and ashpy/losses/GeneratorL1 are the sublosses of ashpy/losses/Pix2PixLoss

Apr 02 '20 10:04 EmanueleGhelfi

ashpy ashpy copied to clipboard

[BUG] - Measuring performance at the end of each epoch, disables training logging

ashpy
ashpy copied to clipboard