dvclive summary: Add option to store **best** value?

By default the values for each metric at the latest step are being saved in the summary. It might be interesting to add an option to additionally save the best value for each metric.

An example use case would be the usual deep learning training loop where the model is being validated at the end of each epoch and the last epoch is not necessarily the one with the best performance. Having the best value saved in the summary could be more useful for comparing experiments (i.e. dvc metrics diff --targets dvclive.json)

A potential problem would be how to expose to the user (https://github.com/iterative/dvclive/issues/75#issuecomment-842186329) some options like whether to use the ¿save_best? or not and what to consider as better (i.e. higher or lower)

Jun 14 '21 12:06 daavoo

As to problems: Related: #81 - both in this issue and here we need to come up with a way to store the data in a structured way. Maybe summary json needs to be separated into "latest", "system" and "best" sections.

As to saving the best, I think we should let users provide callable taking list as argument/comparing two values, that by default would be choosing max(list) or max(x1, x2). That way users could provide their own methods to find the best.

This kind of solution might be quite short-sighted though, with heavily customized metrix one might have to provide this argument each time they call log. This is something that needs to be considered. Maybe we need to be able to configure that.

Jun 14 '21 14:06 pared

As to problems: Related: #81 - both in this issue and here we need to come up with a way to store the data in a structured way. Maybe summary json needs to be separated into "latest", "system" and "best" sections.

I think that a structured summary would be a great idea and it would make the summary easier to extend for future ideas/feature requests.

As to saving the best, I think we should let users provide callable taking list as argument/comparing two values, that by default would be choosing max(list) or max(x1, x2). That way users could provide their own methods to find the best.

This kind of solution might be quite short-sighted though, with heavily customized metrix one might have to provide this argument each time they call log. This is something that needs to be considered. Maybe we need to be able to configure that.

I'm not sure if a custom callable would be needed. I searched for similar scenarios (i.e. the tensorflow/keras ModelCheckpoint) and they just provide the options min, max or auto for the mode argument.

A problem in the dvclive scenario would be that the mode for selecting the best value needs to be specified for each metric whereas in the ModelCheckpoint only one metric is being monitored.

Jun 15 '21 10:06 daavoo

I think that a structured summary would be a great idea and it would make the summary easier to extend for future ideas/feature requests.

Yeah, but before implementing that we need to consider how do we want to handle it on DVC side - if we include those in summary json, those values will be treated as metrics from dvc point of view. While it might make sense for "best" and "latest" I don't think its viable use case for "system" metrics. On the other hand I can totally imagine situation where someone wants to visualize memory usage and want to plot system controlled metrics. We need to research and discuss what do we actually want from those parameters.

Jun 15 '21 10:06 pared

Yeah, but before implementing that we need to consider how do we want to handle it on DVC side - if we include those in summary json, those values will be treated as metrics from dvc point of view. While it might make sense for "best" and "latest" I don't think its viable use case for "system" metrics. On the other hand I can totally imagine situation where someone wants to visualize memory usage and want to plot system controlled metrics. We need to research and discuss what do we actually want from those parameters.

I see your point. I think that we could make a distinction between:

live metrics

Generated in dvclive.log Stored in dvclive/metric.tsv. Intended to be used along with dvc plots (i.e to monitor the train loop by refreshing the .html)

dvc metrics

Generated in dvclive.next_step Stored in dvclive.json Values depending on their corresponding .tsv files using some sort of "aggregation"? (i.e. latest, best, mean, etc.) Intended to be used along with dvc metrics (i.e. to compare experiments)

So, for the specific case of system metrics (#81), we would have Memory usage as a live metric being periodically logged to the .tsv and then some sort of aggregation (i.e. Mean Memory Usage) saved in the .json for being used as a dvc metric

In this scenario I think that your original idea of custom callable would make much more sense for letting the user decide the "aggregation".

Jun 15 '21 11:06 daavoo

@daavoo Having max/min is a good idea!

Do we need extra metrics / files? Can it be done as DVC cmd. Like dvc metrics show --max metrics.json or --show-json if you need a json file.

For system metrics - yes, a separate metrics / file is needed.

Jun 15 '21 14:06 dmpetrov

I was trying an example repo with dvclive using dvc expepriments where the pipeline some stages after the train stage (where dvclive is actually used) that depend on selecting the best checkpoint.

In this dvc exp workflow the issue doesn't seem so relevant if you go with the following workaround:

I first run the pipeline and stop after the train stage:

dvc exp run train

At this point dvclive has created all the checkpoints. So I visualize the table:

dvc exp show

And manually (looking at the metrics/params) select and apply the best one:

dvc exp apply {hash of best checkpoint}
git add .
git commit -m "Applied  {hash of best checkpoint}"

After that I run the downstream stages that depended on the best checkpoint:

dvc exp run {stage-after-train} --downstream

However I think that there should be a better way to automatically "select and apply" the best checkpoint in order to avoid the manual step and allow the pipeline to run end-to-end.

Jul 16 '21 14:07 daavoo

In this case, you not only want to keep an additional metric for the best value, but you might want to save the best model instead of the latest model, similar to the restore_best_weights option in https://keras.io/api/callbacks/early_stopping/. If dvclive could keep track of the best model and save it at each epoch, then it would be trivial to automate that dvc pipeline, and it would save space wasted on worse models.

Jul 16 '21 18:07 dberenbaum

@dberenbaum wouldn't that mean that we would somehow need to spercify what does "the best" mean?

Jul 19 '21 10:07 pared

In this case, you not only want to keep an additional metric for the best value, but you might want to save the best model instead of the latest model, similar to the restore_best_weights option in https://keras.io/api/callbacks/early_stopping/. If dvclive could keep track of the best model and save it at each epoch, then it would be trivial to automate that dvc pipeline, and it would save space wasted on worse models.

If you use dvclive alonside DVC, it kind of already saves the space if you follow the workflow I described above.

dvc exp run saves all models (for example if dvclive.keras integration is used it will create one checkpoint per epoch) but dvc exp apply (user selects which one is the "best") followed by git commit will delete all the other models and just keep the the selected checkpoint, right?

Jul 19 '21 10:07 daavoo

@dberenbaum wouldn't that mean that we would somehow need to spercify what does "the best" mean?

Yes. At least, it would probably entail specifying the metric to monitor and whether min/max is best.

If you use dvclive alonside DVC, it kind of already saves the space if you follow the workflow I described above.

dvc exp run saves all models (for example if dvclive.keras integration is used it will create one checkpoint per epoch) but dvc exp apply (user selects which one is the "best") followed by git commit will delete all the other models and just keep the the selected checkpoint, right?

Actually deleting the other models requires some additional steps (dvc exp gc and dvc gc), and it doesn't prevent storage from potentially blowing up with poorly performing models in the first place.

Maybe more importantly, keeping only the best model is one way to address the issue you raised:

However I think that there should be a better way to automatically "select and apply" the best checkpoint in order to avoid the manual step and allow the pipeline to run end-to-end.

If dvclive had an option similar to restore_best_weights, the full pipeline could run automatically since the model file from the latest checkpoint would always be the best. It also preserves the metrics from epochs after performance started to degrade, unlike the current approach where you lose all information about epochs after the best one.

I'm not sure whether this would be a good idea, but curious to get your thoughts on this workflow @daavoo.

Jul 19 '21 14:07 dberenbaum

If dvclive had an option similar to restore_best_weights, the full pipeline could run automatically since the model file from the latest checkpoint would always be the best. It also preserves the metrics from epochs after performance started to degrade, unlike the current approach where you lose all information about epochs after the best one.

I'm not sure whether this would be a good idea, but curious to get your thoughts on this workflow @daavoo.

I personally like that workflow more than the checkpoint apply, however I'm not sure what would be the best way to implement it.

On the one hand we could implement all the functionality on the dvclive side; meaning that, in addition to taking care of saving the model (#105) for each integration, we would need to:

Implement a "cross-integration" logic for monitoring the best metric
Expose it to the user via configuration, potentially requiring additions on the DVC side (i.e. to the--live namespace).

I think this would have the benefit of leaving us with full control over the implementation.

However, we would probably need to consider how easily users could adapt their code to this workflow and how compatible would be with existing functionality on each integration. For example, if someone is using keras with the ModelCheckpoint and EarlyStopping callbacks and wants to migrate to our hypothetical workflow we should probably suggest them to remove those callbacks in order to get things working as we intend.

This could also mean that, for some integrations, the user will lose functionality as we might not cover as much as the own ML Framework.

On the other hand we could rely on somehow reusing the logic for monitoring the best metric already implemented in each integration, similar to what the mlflow<>keras integration does with the EarlyStopping callback. This would have the benefit for users to easily extend their workflow without removing stuff already implemented in the ML Framework. However this might not be possible in every integration and it would probably significantly increase the effort required to add and maintain integrations.

Jul 20 '21 12:07 daavoo

Great points! Users will probably rather reuse the existing functionality within those ML frameworks, and it fits better with the dvclive ethos of being lightweight and interfering with code as little as possible. It does make the integrations more involved and potentially inconsistent between frameworks, but better to handle this in dvclive than expect users to do it themselves, right?

Existing model checkpoint callbacks already handle storing the best metric value and model weights associated with it:

https://keras.io/api/callbacks/model_checkpoint/
https://pytorch-lightning.readthedocs.io/en/latest/extensions/generated/pytorch_lightning.callbacks.ModelCheckpoint.html#pytorch_lightning.callbacks.ModelCheckpoint

Maybe we can focus on building integrations with existing callbacks.

I can think of a few specific enhancements we could build around summarizing the best values:

ML frameworks assume that overwriting the previous models makes them unrecoverable. DVC makes this constraint obsolete since users can checkout the model from any prior checkpoint. We may want to consider how to utilize DVC's strength here in being able to overwrite previous models but still recover them later. This is similar to the mlflow-keras early stopping integration that you mentioned @daavoo, where instead of only restoring the best weights they log one step with the latest metrics and model and another with the restored metrics and model.
For manual selection of the best model, some advanced sorting or summarization may still be helpful after the fact (see the above suggestion from @dmpetrov). Users can look for best values in multiple ways before deciding which is truly best. Again, DVC's ability to checkout any prior checkpoint makes this easy (although downstream stages still require the clunky --downstream workflow from above).

Aug 02 '21 20:08 dberenbaum

wandb now has support for something similar: https://docs.wandb.ai/guides/track/log#customize-the-summary.

DVCLive could support something similar with either:

Granular best metrics like live.log("loss", 0.2, keep="min"). This could keep only the min loss instead of the latest loss in the summary metrics.json. Other summary metrics would be recorded with the latest values (or min or max as defined by their log calls).
A single defined objective for the summary like live.make_summary(objective="loss", keep="min"). The difference here is that DVCLive would find the step with the min loss and keep all metrics from that step in metrics.json.

I like the latter since mixing metrics from different steps seems unhelpful to me.

Edit: As part of 2, we could also put these arguments into the init method like Live(objective="loss", keep="min") and then pass them to make_summary.

Oct 27 '22 13:10 dberenbaum

dvclive
dvclive copied to clipboard

summary: Add option to store best value?

dvclive dvclive copied to clipboard

summary: Add option to store **best** value?

dvclive
dvclive copied to clipboard

summary: Add option to store best value?