returnn
returnn copied to clipboard
Tensorboard Scalars Cleanup
Hi, I recently tried to setup Tensorboard to plot the train/dev scores and errors during training. I got Tensorboard running and could find all information I need but the way its presented looks rather complicated to me.
In the picture you can see how a training with 5 epochs and CE as the only loss currently looks for me. As far as I know that`s also the only way returnn displays values in Tensorboard but please correct me if I am wrong.
Following things I find rather unintuitive:
-
There is one run for each dev epoch (I have not yet tested it for eval)
- Imagine you want to compare 10 different trainings which have 50 epochs each. That would result in 10*50 runs which prevents any overview.
-
The x-Axis displays steps and not epochs
- I think when looking at results you always want them to be calculated across the whole dataset and not small parts of it? If so, there would be no value in looking at the results per step.
- Displaying values per epoch would decrease the number of datapoints significantly. Looking at the example above, for me 10 values per plot would suffice. 5 train scores, 5 dev scores. Instead I got 3343 values per plot for the training alone^^
-
Too many plots
- In the picture above are 7 plots where some seem to carry redundant information
- Those look somewhat intuitive and not redundant: objective/constraints, objective/loss, objective/loss/error_decision, objective/loss/error_output_output_prob, objective/loss/loss_output_output_prob
- Here I am a bit lost for what they are good for (last two): objective/loss/objective_loss_output_output_prob, objective/objective
How would the perfect setup look?
- One Tensorboard run for each dataset (train,dev,eval)
- The train set is already correct
- Each run contains
- one plot for each score/error
- one plot for the combined score/(error?)
- The plots have epochs on the x-Axis
- Maybe an option for the stepwise display?
Since returnn is a general framework I am not too sure if I can draw conclusions from my case to others. So it would be nice to know if others experience the same behavior. Or maybe someone who knows the implementation can ensure that that`s the only way it looks.
If so, would you aggree with the points I am criticising and the perfect world setup?
Here I am a bit lost for what they are good for (last two): objective/loss/objective_loss_output_output_prob, objective/objective
That is something, I also did not understand completely. What is the difference between loss:
, cost:
and error
in Returnn?
In general, I think @albertz mentioned that it might make sense to move Returnn from an epoch wise model to a stepwise model, i.e. to not have epochs anymore, but to calculate the dev score every n steps. However, I do not know if this is on the roadmap right now.
I agree, that epoch-wise scores would be nice. I think it might make sense to keep the step-wise plots and add epoch-wise plots in another folder, so that one can load either one or both by setting the tensor board logdir appropriately. This makes sense especially, if the training is longer and the tensor board files become bigger, i.e. take longer to load.
The logic for the logdir and TF event file writer is in Runner.run
. It’s pretty simple currently.
It was a design choice to have a separate logdir per dev/eval run, as a simple way to have each epoch visually separated. But as you point out, this is not really the optimal solution.
We can easily use the same logdir for all dev/eval runs.
The x-Axis displays steps and not epochs
Everything in Tensorboard is designed for steps. Tensorboard (and TensorFlow in general) doesn't really have the concept of epochs at all. On the other side, there is always one global step index, and that is always the x-axis in TensorBoard. We cannot really change that.
We could somehow "fake" it by making one TF event file where we insert data where we set the step = epoch.
In the picture above are 7 plots where some seem to carry redundant information
It will just log everything you have in your network. Every individual loss (value + error), combined losses, constraints (L2 etc), and the final combined objective (constraints + losses).
As far as I remember, it was easy to filter out some of them.
But some of them you can also simply remove by fixing/cleaning up your config. E.g. you have your network such that it contains error_decision, even though this is not used without training (it doesn't define a loss value, only the error; and the error is zero because it calculates WER from reference to reference). You could maybe use only_on_search
on this loss, or just remove this layer when not using search.
We could be a bit clever and remove the constraints when no constraints are used.
What is the difference between
loss:
,cost:
anderror
in Returnn?
This should be documented. (If not, open a separate issue about it.)
It's called cost
for historical reasons. It's the same as loss
in the TF event files. This is used for training, to get the gradient to minimize this value.
error
is just additional information which is not used for training but purely for logging. Usually it is the frame error rate, but it could also be other things (e.g. edit_distance
).
it might make sense to move Returnn from an epoch wise model to a stepwise model, i.e. to not have epochs anymore, but to calculate the dev score every n steps. However, I do not know if this is on the roadmap right now.
We already have that. Our concept of "epochs" doesn't necessarily correspond to anything anymore. It could be anything (via epoch split, or whatever other fancy dataset constructs you use).
An "epoch" just defines when you do checkpoints, calculate dev, and adopt the learning rate. The dataset could handle this such that you get an epoch just every N steps.
(Although, I think many users currently just use epoch split + maybe some filtering / curriculum learning, so there is still some correspondence for many users.)