flambe
flambe copied to clipboard
No in-built functionality for tracking of metrics during training
This is a feature request bordering on a bug. Right now, flambe does not allow to track metrics during training. This, however, is essential to monitor learning.
One problem that I see is that it does not make sense to compute the train metrics after an entire train epoch, as flambe does for test/eval metrics. Given the size of some datasets, this is not really feasible.
Consequently, the interface to the metrics needs to be able to accommodate for the incremental computation of the metrics. That, in turn, requires a decision as to how this should be implemented, partly because not every metric supports incremental computation (think: AUC). Unfortunately, having incremental computation requires to keep track of previous computations - i.e., we need a state that we update incrementally
From the top of my head, these are the choices we have:
First option: make the metrics state-ful.
- The metrics would then have to be "reset" at the beginning of each epoch
- An
incremental
method, added to the metric, could be used to update the metric
Second option: add a metric-state object.
- Flambe initializes a metric-state object at the beginning of each epoch.
- This metric-state object is passed into each
incremental
call of the metric (and any other, possibly, to have a uniform interface) - Logging can happen automatically in a method of that state-object
Third option: add local tracking for each metric (I don't think this is a good option, but I wanted to mention it for completeness)
- This works like the metric state object, but with individual state objects per metric.
we could also consider just computing the metrics on a per batch level during training and logging that, but then things like dropout will affect the training metrics. That's true in your proposed solutions as well, unless you re thinking of doing this during the eval step?
The problem with the per batch level are things like AUC. If we are using the batch as negatives (as is quite common), computing the AUC per batch will be much less accurate than computing it per epoch (and using all samples from an epoch as negatives).
Besides, either approach would allow us to unify this (taken from _eval_step
in train.py
):
log(f'{tb_prefix}Validation/Loss', val_loss, self._step)
log(f'{tb_prefix}Validation/{self.metric_fn}', val_metric, self._step)
log(f'{tb_prefix}Best/{self.metric_fn}', self._best_metric, self._step) # type: ignore
for metric_name, metric in self.extra_validation_metrics.items():
log(f'{tb_prefix}Validation/{metric_name}',
metric(preds, targets).item(), self._step) # type: ignore
With either
for metric in self.metrics:
metric.finalize()
metric.log(log_func) . # log_func could be any log function, defaulting to the one above
Or
for metric in self.metrics:
metric.finalize(metrics_state)
metric.log(log_func, metrics_state)
That has the additional advantage that we would support logging of metrics that are more complex natively. Imagine, e.g., a combined recall-precision-fscore metric, that could jointly log all three. Or one that computes a conditional metric if, say, you have different types of samples. Than it could log things like "accuracy for type1: ..." and "accuracy for type2: ..."
What do you propose to fo with the training and validationloss over the whole dataset? Since people generally use torch loss objects which won't have the "incremental" logic?