ashpy
ashpy copied to clipboard
[BUG] - Measuring performance at the end of each epoch, disables training logging
The title says pretty much all.
If I want to measure the performance of the model only at the end of every epoch, the (for example) ClassifierTrainer offers me the possibility of setting a value for the logging frequency to something <= 0
.
In this way, I measure the performance at the end of every epoch on the validation set (it works).
However, on tensorboard I only see the plot of the validation curves, the training curves aren't displayed anymore
The issue was actually related to measure_performance_freq=-1
Let's define a coherent behavior.
The parameter measure_performance_freq
is responsible for the frequency of performance measurements.
In the current implementation the average loss is measured as a performance metric.
We can measure the performance at the end of every epoch also on the training set (easy).
Regarding the training curve at every step we have to define what to do.
The parameter measure_performance_freq is responsible for the frequency of performance measurements.
Performance measurements during training and validation.
In the current implementation the average loss is measured as a performance metric.
Exactly, thus it should follow the same logging frequency of any other metric
We can measure the performance at the end of every epoch also on the training set (easy).
We definitely should do this, it makes no sense to have a plot like the one I reported.
Regarding the training curve at every step we have to define what to do.
Suggestions?
Let's talk about the Classifier case.
ClassifierLoss (the metric) uses the loss defined in the classifier to measure the actual loss.
In the classifier trainer the ClassifierLoss is added to the set of metrics. The metrics are treated transparently and the only method to call when there is the need to measure performance is:
_measure_performance()
That loops on the metrics and takes care of everything.
So, in order to plot the training loss at each step on TB we can:
- use directly the attribute
avg_loss
of the classifier in the training loop. In this case however we should take care of not plotting two times the same metric. - Remove the
avg_loss
from the metrics and treat it in a different way - Refactor everything
- Plot directly the loss returned from the train_step
- I don't know
Point 3: refactor everything.
It happens often that our loss function is the composition of executors (e.g. is a sum executor).
Term A + Term B.
With the actual way of logging, I can only have the plot of the term "Term A + Term B" and not 3 plots:
- Term A
- Term B
- Term A + Term B
I added the log of the losses. However there is one thing to remember. When we are in a distributed execution with 2 GPUs for example, we have two calls to the _train_step method.
In the associated PR I implemented the tensorboard log inside the loss object (executor). However in a distributed scenario the object is the same, thus we have a concurrent execution of the loss.log
method. Only the last log is taken into account. Also, in a distributed execution we should reduce the value of each term of the loss across the devices.
This becomes difficult in the current implementation.
Moreover, the loss does not know the strategy at the moment. Actually it can understand the strategy but in any case the computed tensors are just normal tensors.
Current implementation is visible at: https://tensorboard.dev/experiment/e4pojFWfQWeksPVMQJLTgA/#scalars
ashpy/losses/GeneratorLSGAN and ashpy/losses/GeneratorL1 are the sublosses of ashpy/losses/Pix2PixLoss