dvclive
dvclive copied to clipboard
summary: Add option to store **best** value?
By default the values for each metric at the latest step are being saved in the summary
. It might be interesting to add an option to additionally save the best value for each metric.
An example use case would be the usual deep learning training loop where the model is being validated at the end of each epoch and the last epoch is not necessarily the one with the best performance. Having the best value saved in the summary
could be more useful for comparing experiments (i.e. dvc metrics diff --targets dvclive.json
)
A potential problem would be how to expose to the user (https://github.com/iterative/dvclive/issues/75#issuecomment-842186329) some options like whether to use the ¿save_best
? or not and what to consider as better (i.e. higher or lower)
As to problems: Related: #81 - both in this issue and here we need to come up with a way to store the data in a structured way. Maybe summary json needs to be separated into "latest", "system" and "best" sections.
As to saving the best, I think we should let users provide callable taking list as argument/comparing two values, that by default would be choosing max(list) or max(x1, x2). That way users could provide their own methods to find the best.
This kind of solution might be quite short-sighted though, with heavily customized metrix one might have to provide this argument each time they call log. This is something that needs to be considered. Maybe we need to be able to configure that.
As to problems: Related: #81 - both in this issue and here we need to come up with a way to store the data in a structured way. Maybe summary json needs to be separated into "latest", "system" and "best" sections.
I think that a structured summary would be a great idea and it would make the summary easier to extend for future ideas/feature requests.
As to saving the best, I think we should let users provide callable taking list as argument/comparing two values, that by default would be choosing max(list) or max(x1, x2). That way users could provide their own methods to find the best.
This kind of solution might be quite short-sighted though, with heavily customized metrix one might have to provide this argument each time they call log. This is something that needs to be considered. Maybe we need to be able to configure that.
I'm not sure if a custom callable would be needed. I searched for similar scenarios (i.e. the tensorflow/keras ModelCheckpoint) and they just provide the options min
, max
or auto
for the mode
argument.
A problem in the dvclive
scenario would be that the mode
for selecting the best value needs to be specified for each metric whereas in the ModelCheckpoint only one metric is being monitored.
I think that a structured summary would be a great idea and it would make the summary easier to extend for future ideas/feature requests.
Yeah, but before implementing that we need to consider how do we want to handle it on DVC side - if we include those in summary json, those values will be treated as metrics from dvc point of view. While it might make sense for "best" and "latest" I don't think its viable use case for "system" metrics. On the other hand I can totally imagine situation where someone wants to visualize memory usage and want to plot system controlled metrics. We need to research and discuss what do we actually want from those parameters.
Yeah, but before implementing that we need to consider how do we want to handle it on DVC side - if we include those in summary json, those values will be treated as metrics from dvc point of view. While it might make sense for "best" and "latest" I don't think its viable use case for "system" metrics. On the other hand I can totally imagine situation where someone wants to visualize memory usage and want to plot system controlled metrics. We need to research and discuss what do we actually want from those parameters.
I see your point. I think that we could make a distinction between:
- live metrics
Generated in dvclive.log
Stored in dvclive/metric.tsv
.
Intended to be used along with dvc plots
(i.e to monitor the train loop by refreshing the .html
)
- dvc metrics
Generated in dvclive.next_step
Stored in dvclive.json
Values depending on their corresponding .tsv
files using some sort of "aggregation"? (i.e. latest
, best
, mean
, etc.)
Intended to be used along with dvc metrics
(i.e. to compare experiments)
So, for the specific case of system
metrics (#81), we would have Memory usage
as a live metric being periodically logged to the .tsv
and then some sort of aggregation (i.e. Mean Memory Usage
) saved in the .json
for being used as a dvc metric
In this scenario I think that your original idea of custom callable would make much more sense for letting the user decide the "aggregation".
@daavoo Having max/min is a good idea!
Do we need extra metrics / files? Can it be done as DVC cmd. Like dvc metrics show --max metrics.json
or --show-json
if you need a json file.
For system metrics - yes, a separate metrics / file is needed.
I was trying an example repo with dvclive
using dvc expepriments
where the pipeline some stages after the train
stage (where dvclive
is actually used) that depend on selecting the best checkpoint.
In this dvc exp
workflow the issue doesn't seem so relevant if you go with the following workaround:
I first run the pipeline and stop after the train stage:
dvc exp run train
At this point dvclive
has created all the checkpoints. So I visualize the table:
dvc exp show
And manually (looking at the metrics/params) select and apply the best one:
dvc exp apply {hash of best checkpoint}
git add .
git commit -m "Applied {hash of best checkpoint}"
After that I run the downstream stages that depended on the best checkpoint:
dvc exp run {stage-after-train} --downstream
However I think that there should be a better way to automatically "select and apply" the best checkpoint in order to avoid the manual step and allow the pipeline to run end-to-end.
In this case, you not only want to keep an additional metric for the best value, but you might want to save the best model instead of the latest model, similar to the restore_best_weights
option in https://keras.io/api/callbacks/early_stopping/. If dvclive could keep track of the best model and save it at each epoch, then it would be trivial to automate that dvc pipeline, and it would save space wasted on worse models.
@dberenbaum wouldn't that mean that we would somehow need to spercify what does "the best" mean?
In this case, you not only want to keep an additional metric for the best value, but you might want to save the best model instead of the latest model, similar to the
restore_best_weights
option in https://keras.io/api/callbacks/early_stopping/. If dvclive could keep track of the best model and save it at each epoch, then it would be trivial to automate that dvc pipeline, and it would save space wasted on worse models.
If you use dvclive
alonside DVC
, it kind of already saves the space if you follow the workflow I described above.
dvc exp run
saves all models (for example if dvclive.keras
integration is used it will create one checkpoint per epoch) but dvc exp apply
(user selects which one is the "best") followed by git commit
will delete all the other models and just keep the the selected checkpoint, right?
@dberenbaum wouldn't that mean that we would somehow need to spercify what does "the best" mean?
Yes. At least, it would probably entail specifying the metric to monitor and whether min/max is best.
If you use
dvclive
alonsideDVC
, it kind of already saves the space if you follow the workflow I described above.
dvc exp run
saves all models (for example ifdvclive.keras
integration is used it will create one checkpoint per epoch) butdvc exp apply
(user selects which one is the "best") followed bygit commit
will delete all the other models and just keep the the selected checkpoint, right?
Actually deleting the other models requires some additional steps (dvc exp gc
and dvc gc
), and it doesn't prevent storage from potentially blowing up with poorly performing models in the first place.
Maybe more importantly, keeping only the best model is one way to address the issue you raised:
However I think that there should be a better way to automatically "select and apply" the best checkpoint in order to avoid the manual step and allow the pipeline to run end-to-end.
If dvclive had an option similar to restore_best_weights
, the full pipeline could run automatically since the model file from the latest checkpoint would always be the best. It also preserves the metrics from epochs after performance started to degrade, unlike the current approach where you lose all information about epochs after the best one.
I'm not sure whether this would be a good idea, but curious to get your thoughts on this workflow @daavoo.
If dvclive had an option similar to
restore_best_weights
, the full pipeline could run automatically since the model file from the latest checkpoint would always be the best. It also preserves the metrics from epochs after performance started to degrade, unlike the current approach where you lose all information about epochs after the best one.I'm not sure whether this would be a good idea, but curious to get your thoughts on this workflow @daavoo.
I personally like that workflow more than the checkpoint apply
, however I'm not sure what would be the best way to implement it.
On the one hand we could implement all the functionality on the dvclive
side; meaning that, in addition to taking care of saving the model (#105) for each integration, we would need to:
- Implement a "cross-integration" logic for monitoring the best metric
- Expose it to the user via configuration, potentially requiring additions on the DVC side (i.e. to the
--live
namespace).
I think this would have the benefit of leaving us with full control over the implementation.
However, we would probably need to consider how easily users could adapt their code to this workflow and how compatible would be with existing functionality on each integration. For example, if someone is using keras
with the ModelCheckpoint
and EarlyStopping
callbacks and wants to migrate to our hypothetical workflow we should probably suggest them to remove those callbacks in order to get things working as we intend.
This could also mean that, for some integrations, the user will lose functionality as we might not cover as much as the own ML Framework.
On the other hand we could rely on somehow reusing the logic for monitoring the best metric already implemented in each integration, similar to what the mlflow<>keras
integration does with the EarlyStopping
callback. This would have the benefit for users to easily extend their workflow without removing stuff already implemented in the ML Framework. However this might not be possible in every integration and it would probably significantly increase the effort required to add and maintain integrations.
Great points! Users will probably rather reuse the existing functionality within those ML frameworks, and it fits better with the dvclive ethos of being lightweight and interfering with code as little as possible. It does make the integrations more involved and potentially inconsistent between frameworks, but better to handle this in dvclive than expect users to do it themselves, right?
Existing model checkpoint callbacks already handle storing the best metric value and model weights associated with it:
- https://keras.io/api/callbacks/model_checkpoint/
- https://pytorch-lightning.readthedocs.io/en/latest/extensions/generated/pytorch_lightning.callbacks.ModelCheckpoint.html#pytorch_lightning.callbacks.ModelCheckpoint
Maybe we can focus on building integrations with existing callbacks.
I can think of a few specific enhancements we could build around summarizing the best values:
- ML frameworks assume that overwriting the previous models makes them unrecoverable. DVC makes this constraint obsolete since users can checkout the model from any prior checkpoint. We may want to consider how to utilize DVC's strength here in being able to overwrite previous models but still recover them later. This is similar to the mlflow-keras early stopping integration that you mentioned @daavoo, where instead of only restoring the best weights they log one step with the latest metrics and model and another with the restored metrics and model.
- For manual selection of the best model, some advanced sorting or summarization may still be helpful after the fact (see the above suggestion from @dmpetrov). Users can look for best values in multiple ways before deciding which is truly best. Again, DVC's ability to checkout any prior checkpoint makes this easy (although downstream stages still require the clunky
--downstream
workflow from above).
wandb now has support for something similar: https://docs.wandb.ai/guides/track/log#customize-the-summary.
DVCLive could support something similar with either:
- Granular best metrics like
live.log("loss", 0.2, keep="min")
. This could keep only the min loss instead of the latest loss in the summarymetrics.json
. Other summary metrics would be recorded with the latest values (or min or max as defined by theirlog
calls). - A single defined objective for the summary like
live.make_summary(objective="loss", keep="min")
. The difference here is that DVCLive would find the step with the min loss and keep all metrics from that step inmetrics.json
.
I like the latter since mixing metrics from different steps seems unhelpful to me.
Edit: As part of 2, we could also put these arguments into the init method like Live(objective="loss", keep="min")
and then pass them to make_summary
.